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OUTCOME PREDICTION AND RISK CLASSIFICATION IN CHILDHOOD 

LEUKEMIA 

5 This application claims the benefit of U.S. Provisional Applications 

Serial Nos. 60/432,064; 60/432,077; and 60/432,078; all of which were filed 
December 6, 2002; and U.S. Provisional Applications Serial Nos. 60/510,904 
and 60/510,968, both of which were filed October 14, 2003; and a U.S. 
Provisional Application entitled "Outcome Prediction in Childhood Leukemia" 
10 filed on even date herewith. These provisional applications are incorporated 
herein by reference in their entireties. 

STATEMENT OF GOVERNMENT RIGHTS 
This invention was made with government support under a grant fi-om 
1 5 the National Institutes of Health (National Cancer Institute), Grant No. NIH 
NCI UOl CA88361; and under a contract fi-om the Department of Energy, 
Contract No. DE-AC04-94AL85000. The U.S. Government has certain rights 
in this invention. 

20 BACKGROUND OF THE INVENTION 

Leukemia is the most common childhood malignancy in the United 
States. Approximately 3,500 cases of acute leukemia are diagnosed each year 
in the U.S. in children less than 20 years of age. The large majority (>70%) of 
these cases are acute lymphoblastic leukemias (ALL) and the remainder acute 

25 myeloid leukemias (AML). The outcome for children with ALL has improved 
dramatically over the past three decades, but despite significant progress in 
treatment, 25% of children with ALL develop recurrent disease. Conversely, 
another 25% of children who now receive dose intensification are likely "over- 
treated" and may well be cured using less intensive regimens resulting in fewer 

30 toxicities and long term side effects. Thus, a major challenge for the treatment 
of children with ALL in the next decade is to improve and refine ALL diagnosis 
and risk classification schemes in order to precisely tailor therapeutic 
approaches to the biology of the tumor and the genotype of the host. 

1 



Leukemia in the first 12 months of life (referred to as infant leukemia) is 
extremely rare in the United States, with about 150 infants diagnosed each year. 
There are several clinical and genetic factors that distinguish infant leukemia 
from acute leukemias that occur in older children. First, while the percentage 
5 of acute lymphoblastic leukemia (ALL) cases is far more fi-equent 

(approximately five times) than acute myeloid leukemia in children from ages 
1-15 years, the fi^equency of ALL and AML in infants less than one year of age 
is approximately equivalent. Secondly, in contrast to the extensive 
heterogeneity in cytogenetic abnormalities and chromosomal rearrangements in 

10 older children with ALL and AML, nearly 60% of acute leukemias in infants 
have chrorriosomal rerrangments involving the MLL gene (for Mixed Lineage 
Leukemia) on chromosome 1 lq23. MLL translocations characterize a subset of 
human acute leukemias with a decidedly unfavorable prognosis. Current 
estimates suggest that about 60% of infants with AML and about 80% of infants 

15 with ALL have a chromosomal rearrangment involving MLL abnormality in 
their leukemia cells. Whether hematopoietic cells in infants are more likely to 
undergo chromosomal rearrangements involving 1 lql3 or whether this 1 lql3 
rearrangement reflects a unique environmental exposure or genetic 
susceptibliity remains to be determined. 

20 The modem classification of acute leukemias in children and adults 

relies on morphologic and cytochemical features that may be useful in 
distinguishing AML from ALL, changes in the expression of cell surface 
antigens as a precursor cell differentiates, and the presence of specific recurrent 
cytogenetic or chromosomal rearrangements in leukemic cells. Using 

25 monoclonal antibodies, cell surface antigens (called clusters of differentiation 
(CD)) can be identified in cell populations; leukemias can be accurately 
classified by this means (immimophenotyping). By immimophenotyping, it is 
possible to classify ALL into the major categories of "common - CD 10+ B-cell 
precursor" (around 50%), "pre-B" (around 25%), "T" (around 15%), "null" 

30 (around 9%) and "B" cell ALL (around 1%). All forms other than T-ALL are 

considered to be derived from some stage of B-precursor cell, and "null" ALL is 
sometimes referred to as "early B-precursor" ALL. 
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Current risk classification schemes for ALL in children from 1-18 years 
of age use clinical and laboratory parameters such as patient age, initial white 
blood cell count, and the presence of specific ALL-associated cytogenetic 
abnormalities to stratify patients into "low," "standard," "high," and "very high" 
5 risk categories. National Cancer Institute (NCI) risk criteria are first applied to 
all children with ALL, dividing them into "NCI standard risk" (age 1.00-9.99 
years, WBC < 50,000) and "NCI high risk" (age > 10 years, WBC > 50,000) 
based on age and initial white blood cell count (WBC) at disease presentation. 
In addition to these general NCI risk criteria, classic cytogenetic analysis and 

10 molecular genetic detection of frequently recurring cytogenetic abnormalities 
have been used to stratify ALL patients more precisely into "low," "standard," 
"high," and "very high" risk categories. Fig. 1 shows the 4-ye2ir event free 
survival (EFS) projected for each of these groups. 

These chromosomal aberrations primarily involve structural 

15 rearrangements (translocations) or numerical imbalances (hyperdiploidy - now 
assessed as specific chromosome trisomies, or hypodiploidy). Table 1 shows 
recurrent ALL genetic subtypes, their frequencies and their risk categorization. 
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The rate of disappearance of both B precursor and T ALL leukemic cells 
during induction chemotherapy (assessed morphologically or by other quantitative 
measures of residual disease) has also been used as an assessment of early therapeutic 
response and as a means of targeting children for therapeutic intensification (Gruhn et 
5 al.. Leukemia 12:675-681, 1998; Foroni et aL, Br. J. Haematol. 105:7-24, 1999; van 
Dongen et al.. Lancet 352:1731-1738, 1998; Cave et al., N. Engl. J. Med. 339:591- 
598, 1998; Coustan-Smith et al.. Lancet 351 :550-554, 1998; Chessells et al.. Lancet 
343:143-148, 1995;Nachman et al.,N. Engl. J. Med. 338:1663-1671, 1998). 

Children v/ith "low risk" disease (22% of all B precursor ALL cases) are 

10 defined as having standard NCI risk criteria, the presence of low risk cytogenetic 
abnormalities (t(12;21)/TEL;AMLl or trisomies of chromosomes 4 and 10), and a 
rapid early clearance of bone marrow blasts during induction chemotherapy. Children 
with "standard risk" disease (50% of ALL cases) are NCI standard risk without "low 
risk" or unfavorable cytogenetic features, or, are children with low risk cytogenetic 

15 features who have NCI high risk criteria or slow clearance of blasts during induction. 
Although therapeutic intensification has yielded significant improvements in outcome 
in the low and standard risk groups of ALL, it is likely that a significant number of 
these children are currently "over-treated" and could be cured with less intensive 
regimens resulting in fewer toxicities and long term side effects. Conversely, a 

20 significant number of children even in these good risk categories still relapse and a 
precise means to prospectively identify them has remained elusive. Nearly 30% of 
children with ALL have "high" or "very high" risk disease, defined by NCI high risk 
criteria and the presence of specific cytogenetic abnormalities (such as t(l ;19), t(9;22) 
or hypodiploidy) (Table 1); again, precise measures to distinguish children more 

25 prone to relapse in this heterogeneous group have not been established. 

Despite these efforts, current diagnosis and risk classification schemes remain 
imprecise. Children with ALL more prone to relapse who require more intensive 
approaches and children with low risk disease who could be cured with less intensive 
therapies are not adequately predicted by current classification schemes and are 

30 distributed among all currently defined risk groups. Although pre-treatment clinical 
and tumor genetic stratification of patients has generally improved outcomes by 
optimizing therapy, variability in clinical course continues to exist among individuals 
within a single risk group and even among those with similar prognostic features. In 
fact, the most significant prognostic factors in childhood ALL explain no more than 
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4% of the variability in prognosis, suggesting that yet undiscovered molecular 
mechanisms dictate clinical behavior (Donadieu et al., Br J Haematol, 102:729-739, 
1998). A precise means to prospectively identify such children has remained elusive. 



5 SUMMARY OF THE INVENTION 

The present invention is directed to methods for outcome prediction and risk 
classification in childhood leukemia. In one embodiment, the invention provides a 
method for classifying leukemia in a patient that includes obtaining a biological 

10 sample from a patient; determining the expression level for a selected gene product to 
yield an observed gene expression level; and comparing the observed gene expression 
level for the selected gene product to a control gene expression level. The control 
gene expression level can the expression level observed for the gene product in a 
control sample, or a predetermined expression level for the gene product. An 

15 observed expression level that differs from the control gene expression level is 
indicative of a disease classification. In another aspect, the method can include 
determining a gene expression profile for selected gene products in the biological 
sample to yield an observed gene expression profile; and comparing the observed 
gene expression profile for the selected gene products to a control gene expression 

20 profile for the selected gene products that correlates with a disease classification; 
. wherein a similarity between the observed gene expression profile and the control 
gene expression profile is indicative of the disease classification. 

The disease classification can be, for example, a classification based on 
predicted outcome (remission vs therapeutic failure); a classification based on 

25 karyotype; a classification based on leukemia subtype; or a classification based on 
disease etiology. Where the classification is based on disease outcome, the observed 
gene product is preferably a gene such as OPALl, Gl, G2, FYN binding protein, 
PBKl or any of the genes listed in Table 42. 

A novel gene, referred to herein as OPALl, has been found to be strongly 

30 predictive of outcome in childhood leukemia, and presents new opportunities for 
better diagnosis, risk classification and better therapeutic options. Thus, in another 
embodiment, the invention includes a polynucleotide that encodes OPALl and 
variations thereof, the putative protein gene product of OPALl and variations thereof. 
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and an antibody that binds to OPALl , as well as host cells and vectors that include 
OP ALL 

The invention further provides for a method for predicting therapeutic 
outcome in a leukemia patient that includes obtaining a biological sample from a 
5 patient; determining the expression level for a selected gene product associated with 
outcome to yield an observed gene expression level; and comparing the observed gene 
expression level for the selected gene product to a control gene expression level for . 
the selected gene product. The control gene expression level for the selected gene 
product can include the gene expression level for the selected gene product observed 

10 in a control sample, or a predetermined gene expression level for the selected gene 
product; wherein an observed expression level that is different from the control gene 
expression level for the selected gene product is indicative of predicted remission. 
Preferably, the selected gene product is OPALL Optionally, the method further 
comprises determining the expression level for another gene product, such as Gl or 

15 G2, and comparing in a similar fashion the observed gene expression level for the 
second gene product with a control gene expression level for that gene product, 
wherein an observed expression level for the second gene product that is different 
from the control gene expression level for that gene product is further indicative of 
predicted remission. 

20 The invention further includes a method for detecting an OPALl 

polynucleotide in a biological sample which includes contacting the sample with an 
OPALl polynucleotide, or its complement, under conditions in which the 
polynucleotide selectively hybridizes to an OPALl gene; detecting hybridization of 
the polynucleotide to the OPALl gene in the sample. Likewise, the invention 

25 provides a method for detecting the OPALl protein in a biological sample that 

includes contacting the sample with an OPALl antibody under conditions in which 
the antibody selectively binds to an OPALl protein; and detecting the binding of the 
antibody to the OPALl protein in the sample. Pharmaceutical compositions including 
an therapeutic agent that includes an OPALl polynucleotide, polypeptide or antibody, 

30 together with a pharmaceutically acceptable carrier, are also included. 

The invention further includes a method for treating leukemia comprising 
administering to a leukemia patient a therapeutic agent that modulates the amoimt or 
activity of the polypeptide associated with outcome. Preferably, the therapeutic agent 
increases the amount or activity of OPALl . 
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Also provided by the invention is an in vitro method for screening a 
compound useful for treating leukemia. The invention further provides an in vivo 
method for evaluating a compoimd for use in treating leukemia. The candidate 
compoimds are evaluated for their effect on the expression level(s) of one or more 
5 gene products associated with outcome in leukemia patients. Preferably, the gene 
product whose expression level is evaluated is the product of an OPALl, Gl, G2, 
FYN binding protein or PBKl gene, or any of the genes listed in Table 42. More 
preferably, the gene product is a product of the OPALl gene. 



1 0 BRIEF DESCRIPTION OF THE DRAWINGS 

The patent or application file contains at least one drawing executed in color. 
Copies of this patent or patent application publication with color dravdngs will be 
provided by the Office upon request and payment of the necessary fee. 

15 

Figure 1 shows the 4 year event firee survival (EFS) projected for NCI risk 
categories. 

Figure 2 shows the nucleotide sequences and amino acid sequences for the 
coding regions of two distinct OPALl /GO splice forms. Fig. 2 A shows nucleotide 
20 sequence (SEQ ID NO:l) and amino acid sequence (SEQ ID NO:2) for the 

OPALl/GO splice form incorporation exon 1; and Fig. 2B shows nucleotide sequence 
(SEQ ID NO:3) and amino acid sequence (SEQ ID NO:4) for the OPALl/GO splice 
form incorporation exon la. Exons 1 and la are highlighted by italicized bold print. 
Numbers to the right indicate nucleotide and amino acid positions. Fig. 2C shows the 
25 sequence (SEQ ID NO: 1 6) for the full length cDN A of OPAL 1 . The first exon (exon 
1 in this example) is underlined. The start and end positions for the exons in the 
cDNA and reference sequence (GenBank accession NT_030059.1 1) are as follows: 
exon 1, bases 1 to 171 (23284530 to 23284700), exon 2, bases 172 to 274 (23306276 
to 23306378), exon 3, bases 275 to 436 (23318176 to 23318337) and exon 4, bases 
30 437 to 4008(23320878 to 23324547). The polyadenylation signal (position 4086 to 
4091) is show in bold and italics. 

Figure 3 shows a bootstrap statistical analysis of gene list stability. 

Figure 4 is a Bayesian tree associated with outcome in ALL. 

Figure 5 is schematic drawing of the structvire of OPALl/GO. 
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Figure 6 is a topographic map produced using Vxinsight showing 9 novel 
biologic clusters of ALL (2 distinct T ALL clusters (SI and S2) and 7 distinct B 
precursor ALL clusters (A, B, C, X, Y, Z)) each with distinguishing gene expression 
profiles. 

5 Figure 7 shows a gene list comparison. Principal Component Analysis (PCA 

and the Vxinsight clustering program (ANOVA) were employed to identify genes that 
determined T-cell leukemia cases. The gene lists are compared with those derived 
from the different feature selection methods used by Yeoh et al. (Cancer Cell, 1 :133- 
143, 2002) for T-cell classification. The yellow color represents overlap between the 

10 lists derived by PCA and the T-ALL characterizing gene lists; the cyan represents 
overlap between the ANOVA and the T-ALL characterizing gene lists. The green 
pattem represents genes that are shared by all the lists. 

Figure 8 shows a gene list comparison. Bayesian Networks were employed to 
identify genes that determined the gene expression patterns across the different 

15 translocations. The gene lists were compared with those derived using chi square 
analysis by Yeoh et al. (Cancer Cell, 1 :133-143, 2002) for ALL classification. The 
colored cells represent overlap between the lists derived by Bayesian nets and the 
ALL characterizing gene lists from Yeoh et al. (Cancer Cell, 1:133-143, 2002). 

Figure 9 shows Principal Component Analysis of the infant gene expression 

20 data. Principal Component Analysis (PCA) projections are used to compare the 

ALL/AML partition, the MLL/Non-MLL partition, and the Vxinsight partition of the 
infant gene expression data. The three by three grid of plots in this figure allows this 
comparison by using the same PCA projections with different colors for the different 
partitions. Each row of the grid shows a different partition and each column shows a 

25 different PCA projection. The ALL/AML partition is shown in the first row of the 
figure using light purple for ALL and dark pvuple for AML. The three plots in this 
row give two-dimensional projections of the data onto the first three principal 
components. Since there are three such projections there are three plots (from left to 
right): PC 1 vs. PC 2, PC 2 vs. PC 3, and PC 1 vs. PC 3. This scheme is repeated for 

30 the remaining two partitions. Specifically, the MLL/Non-MLL partition is shown 

using orange and dark green in the second row, and the Vxinsight partition is shown 
using red, green, and blue in the last row. This grid enables both visualization of the 
data (by examining the rows) and comparison of the partitions (by examining the 
columns). 
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Figure 10 shows results of the graphic directed algorithm applied to the infant 
dataset. The Vxinsight program constructs a moimtain terrain over the clusters such 
that the height of each mountain represents the niunber of elements in the cluster 
under the mountain. Top left: this force-directed clustering algorithm partitions the 
5 infant data into three clusters labeled A, B, and C. Top right: Vxinsight terrain map 
showing the distribution of the leukemia types across the clusters. ALL cases are 
shown in white and AML are shown in green. Bottom left: Vxinsight terrain map 
showing the distribution of MLL cases (shown in blue) across the clusters. 

Figure 1 1 shows hierarchical clustering of the 1 26 infant leukemia samples 

10 using the "cluster-characterizing" gene sets. The rows represent genes that distinguish 
between the Vxinsight clusters from Figure 2 (n=150). Genes were selected by 
ANOVA as being the 0.1% top discriminating between each one of the clusters and 
the rest of the cases. Each gene is normalized across all 126 cases and the relative 
expression is depicted in the heat map by color, as shown in the expression scale in 

15 the bottom of the figure. The patient-to-patient distance was computed using Pearson's 
correlation coefficient in the Genespring program (Silicon Genetics). The columns in 
the dendrogram represent patients as clustered by their gene expression. The 
correlation between these three resultant clusters and the Vxinsight clusters is higher 
than 90%. 

20 Figure 12 shows gene expression for various hematopoietic stem cell antigens 

in the infant leukemia data set. Fig. 12A is a gene expression "heat map" of selected 
HOX genes and hematopoetic stem cell antigens. The columns represent genes, while 
the rows represent patients organized by their Vxinsight cluster membership A, B or 
C (see Fig- 10). The gene expression signals of 3 1 genes from the 26 leukemia 

25 patients were normalized relative to the median signal for each gene. The color 
charcaterizes the relative expresssion from the median. Red represents expression 
greater than the median, black is equal to the median and green is less than the 
median. Fig. 12B shows HOX genes median expression across the Vxinsight clusters 
of the infant leukemia data set. The red, blue and black bars represent the median of 

30 expression of each HOX family gene across all the cases in Vxinsight clusters A, B 
and C, respectively. 

Figure 1 3 shows a Vxinsight patient map showing the distribution of MLL 
cases across the clusters derived from gene expression similarities. Top left: 
Magnification of the cluster A (15 ALL/ 5 AML cases), characterized by a "stem cell- 

10 



like" gene expression pattern. Top right: cluster B, mainly ALL (51 ALL/1 AML 
cases). Bottom left: cluster C, mainly AML (12 ALL/42 AML cases). 

Figure 14 shows Affymetrix gene expression signal for the FMS-related 
tyrosine kinase 3 (FLT3) gene across the different MLL translocations. The error bar 
5 represents the standard error of the mean. Other MLL translocations include t(7;l 1 ), 
t(X;ll)and t(ll;ll). 

Figure 15 shows genes that characterize the t(4;l 1) translocation in A vs. B, 
derived from the Vxlnsight clustering program using ANOVA. The red color 
represents genes that have higher expression in the t(4;l 1) cases in Vxlnsight cluster 
10 A against the t(4;l 1) cases in Vxlnsight cluster B. 

Figure 1 6 shows genes that characterize each one of the MLL translocations 
(derived from Bayesian Networks Analysis). The highlighted genes represent 
possible therapeutic targets. 

Figure 17 shows genes that characterize each the t(4;l 1) translocation and the 
15 MLL translocations, derived from Bayesian Networks Analysis, Support Vector 
Machines (SVM), Fuzzy logics and Discriminant Analysis. 

Figure 18 shows genes that characterize the t(4;l 1) translocation (left column) 
and the MLL translocations (right column), derived from the Vxlnsight clustering 
program using ANOVA. The red color represents genes that have higher expression 
20 in the t(4;l 1) cases against the rest of the cases or the MLL cases against the rest. 

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS 

Gene expression profiling can provide insights into disease etiology and 
25 genetic progression, and can also provide tools for more comprehensive molecular 

diagnosis and therapeutic targeting. The biologic clusters and associated gene profiles 
identified herein are useftil for refined molecular classification of acute leukemias as 
well as improved risk assessment and classification. In addition, the invention has 
identified nimierous genes, including but not limited to the novel gene OPALl (also 
30 referred to herein as "GO"), G protein (32, related sequence 1 (also referred to herein as 
"Gl "); IL-10 Receptor alpha (also referred to herein as "G2"), FYN-binding protein 
and PBKl , and the genes listed in Table 42 that are, done or in combination, strongly 
predictive of outcome in pediatric ALL. The genes identified herein, and the proteins 
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they encode, can be used to refine risk classification and diagnostics, to make 
outcome predictions and improve prognostics, and to serve as therapeutic targets in 
infant leukemia and pediatric ALL. 

"Gene expression" as the term is used herein refers to the production of a 
5 biological product encoded by a nucleic acid sequence, such as a gene sequence. This 
biological product, referred to herein as a "gene product," may be a nucleic acid or a 
polypeptide. The nucleic acid is typically an RNA molecule which is produced as a 
transcript from the gene sequence. The RNA molecule can be any type of RNA 
molecule, whether either before (e.g., precursor RNA) or after (e.g., mRNA) post- 
10 transcriptional processing. cDNA prepared from the mRNA of a sample is also 

considered a gene product. The polypeptide gene product is a peptide or protein that 
is encoded by the coding region of the gene, and is produced during the process of 
translation of the mRNA. 

The term "gene expression level" refers to a measure of a gene product(s) of 
1 5 the gene and typically refers to the relative or absolute amount or activity of the gene 
product. 

The term "gene expression profile" as used herein is defined as the expression 
level of two or more genes. Typically a gene expression profile includes expression 
levels for the products of multiple genes in given sample, up to 13,000 in the 
20 experiments described herein, preferably determined using an oligonucleotide 
microarray. 

Unless otherwise specified, "a," "an," "the," and "at least one" are used 
interchangeably and mean one or more than one. 

25 Diagnosis, Prognosis and Risk Classification 

Current parameters used for diagnosis, prognosis and risk classification in 

pediatric ALL are related to clinical data, cytogenetics and response to treatment. 

They include age and white blood count, cytogenetics, the presence or absence of 

minimal residual disease (MRD), and a morphological assessment of early response 
30 (measured as slow or rapid early therapeutic response). As noted above however, 

these parameters are not always well correlated with outcome, nor are they precisely 

predictive at diagnosis. 

The present invention provides an improved method for identifying and/or 

classifying acute leukemias. Expression levels are determined for one or more genes 
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associated with outcome, risk assessment or classification, karyotpe (e.g., MLL 
translocation) or subtype (e.g., ALL vs. AML; pre-B ALL vs. T-ALL. Genes that are 
particuleu-Iy relevant for diagnosis, prognosis and risk classification according to the 
invention include those described in the tables and figures herein. TTie gene 
5 expression levels for the gene(s) of interest in a biological sample fi-om a patient 
diagnosed with or suspected of having an acute leukemia are compared to gene 
expression levels observed for a control sample, or with a predetermined gene 
expression level. Observed expression levels that are higher or lower than the 
expression levels observed for the gene(s) of interest in the control sample or that are 
10 higher or lower than the predetermined expression levels for the gene(s) of interest 
provide information about the acute leukemia that facilitates diagnosis, prognosis, 
and/or risk classification and can aid in treatment decisions. When the expression 
levels of multiple genes are assessed for a single biological sample, a gene expression 
profile is produced. 

15 In one aspect, the invention provides genes and gene expression profiles that 

are correlated with outcome (i.e., complete continuous remission vs. therapeutic 
failure) in infant leukemia and/or in pediatric ALL. Assessment of one or more of 
these genes according to the invention can be integrated into revised risk classification 
schemes, therapeutic targeting and clinical trial design. In one embodiment, the 

20 expression levels of a particular gene are measured, and that measurement is used, 
either alone or with other parameters, to assign the patient to a particular risk 
category. The invention identifies several genes whose expression levels, either alone 
or in combination, are associated with outcome, including but not limited to 
OPALl/GO, Gl, G2, PBKl (Affymetrix accession no. 39418_at, DKFZP564M182 

25 protein; GenBank No. AJ007398); FYN-binding protein (Affymetrix accession no. 
41819_at, FYB-120/130; GenBank No. AF001862; da Silva, Proc. Nat'l. Acad. Sci. 
USA 94(14):7493-7498 (1997)); and the genes listed in Table 42 . Some of these 
genes (e.g., OPALl/GO) exhibit a positive association between expression level and 
outcome. For these genes, expression levels above a predetermined threshold level 

30 (or higher than that exhibited by a control sample) is predictive of a positive outcome. 
Our data suggests that direct measurement of the expression level of OPALl/GO, 
optionally in conjunction with Gl and/or G2, can be used in refining risk 
classification and outcome prediction in pediatric ALL. In particular, it is expected 
such measurements can be used to retlne risk classification in children who are 



otherwise classified as having low risk ALL, as well as to precisely identify children 
with high risk ALL who could be cured with less intensive therapies. 

OPAL 1 /GO, in particular, is a very strong predictor for outcome. Our data 
suggest that OPAL 1 /GO (alone and/or together with Gl and/or G2) may prove to be 
5 the dominant predictor for outcome in infant leukemia or pediatric ALL, more 

powerful than the current risk stratification standards of age and white blood count. 
OPAL 1 /GO tends to be expressed at lower frequencies and lower overall levels in 
ALL cases with cytogenetic abnormalities associated with a poorer prognosis (such as 
t(9;22) and t(4;l 1)). Indeed, regardless of risk classification, cytogenetics or 

10 biological group, roughly the same outcome statistics are seen based upon the 
expression level of OPALl/GO. 

We found that higher OPALl expression distinguished ALL cases with good 
(OPALl high: 87% long term remission) versus poor outcome (OPALl low: 32% 
long term remission) in a statistically designed, retrospective pediatric ALL case 

15 control study (detailed below). Low OPALl was associated with induction failure 
(p=.0036) while high OPALl was associated with long term event free survival 
(p=.02), particularly in males (p=.0004). OPALl was more frequently expressed at 
higher levels in cases with t(12;21), normal karyotype, and hyperdiploidy (better 
prognosis karyotypes) compared to t(l;19) or t(9;22) (poorer prognosis karyotypes). 

20 86% of ALL cases with t(12;21) and high OPALl achieved long term remission in 

contrast to only 35% of t(12;21) cases with low OPALl, suggesting that OPALl may 
be useful in prospectively identifying children who might benefit from further 
intensification. In ALL cases classified as high risk by the NCI criteria, 87% of those 
that exhibited high OPALl levels actually achieved long term remission, compared an 

25 overall long term remission outcome of 44% in this cohort. OPALl was also highly 
predictive of a favorable outcome in T ALL (p=.02) and a similar trend was observed 
in a distinct infant ALL data set (see below). Thus, high OPALl levels are expected 
to be associated with long term remissions on standard, less intensive therapies, and 
conversely low OPALl levels, even in otherwise low risk ALL patients defined by 

30 current risk classification schemes, can identify children who require therapeutic 
intensification for cure. 

For genes such as PBKl whose expression levels are inversely correlated with 
outcome, observed expression levels above a predetermined threshold level (or higher 
than those observed in a control sample) are useful for classifying a patient into a 
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higher risk category due to the predicted unfavorable outcome. Expression levels for 
multiple genes can be measured. For example, if normalized expression levels for 
OPAL 1 /GO, Gl and G2 are all high, a favorable outcome can be predicted with 
greater certainty. 

5 The expression levels of multiple (two or more) genes in one or more lists of 

genes associated with outcome can be measured, and those measurements are used, 
either alone or with other parameters, to assign the patient to a particular risk 
category. For example, gene expression levels of multiple genes can be measured for 
a patient (as by evaluating gene expression using an Affymetrix microarray chip) and 

10 compared to a list of genes whose expression levels (high or low) are associated with 
a positive (or negative) outcome. If the gene expression profile of the patient is 
similar to that of the list of genes associated with outcome, then the patient can be 
assigned to a low (or high, as the case may be) risk category. The correlation between 
gene expression profiles and class distinction can be determined using a variety of 

1 5 methods. Methods of defining classes and classifying samples are described, for 
example, in Golub et al, U.S. Patent Application Publication No. 2003/0017481 
published January 23, 2003, and Golub et al., U.S. Patent Application Publication No. 
2003/0134300, published July 17, 2003. The information provided by the present 
invention, alone or in conjunction with other test results, aids in sample classification 

20 and diagnosis of disease. 

Computational analysis using the gene lists and other data, such as measures 
of statistical significance, as described herein is readily performed on a computer. 
The invention should therefore be understood to encompass machine readable media 
comprising any of the data, including gene lists, described herein. The invention 

25 fiirther includes an apparatus that includes a computer comprising such data and an 
output device such as a monitor or printer for evaluating the results of computationeil 
analysis performed using such data. 

In another aspect, the invention provides genes and gene expression profiles 
that are correlated with cytogenetics. This allows discrimination among the various 

30 karyotypes, such as MLL translocations or numerical imbalances such as 

hyperdiploidy or hypodiploidy, which are usefijl in risk assessment and outcome 
prediction. 

In yet another aspect, the invention provides genes and gene expression 
profiles that are correlated with intrinsic disease biology and/or etiology. In other 

15 
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words, gene expression profiles that are common or shared among individual 
leukemia cases in different patents can be used to define intrinsically related groups 
(often referred to as clusters) of acute leukemia that cannot be appreciated or 
diagnosed using standard means such as morphology, immimophenotype, or 
5 cytogenetics. Mathematical modeling of the very sharp peak in ALL incidence seen 
in children 2-3 years old (>80 cases per million) has suggested that ALL may arise 
from two primary events, the first of which occurs in utero and the second after birth 
(Linet et al.. Descriptive epidemiology of the leukemias, in Leukemias, 5* Edition. 
ES Henderson et al. (eds). WB Saunders, Philadelphia. 1990). Interestingly, the 

10 detection of certain ALL-associated genetic abnormalities in cord blood samples 
taken at birth from children who are ultimately affected by disease supports this 
hypothesis (Gale et al., Proc. Natl. Acad. Sci. U.S.A., 94:13950-13954, 1997; Ford et 
aL, Proc. Natl. Acad. Sci. U.S.A., 95:4584-4588, 1998). 

Our results for both infant leukemia and pediatric ALL suggest that this 

15 disease is composed of novel intrinsic biologic clusters defined by shared gene 

expression profiles, and that these intrinsic subsets cannot be defined or predicted by 
traditional labels currently used for risk classification or by the presence or absence of 
specific cytogenetic abnormalities. We have identified 9 novel groups for pediatric 
ALL and 3 novel groups for infant leukemia using unsupervised leaming methods for 

20 class discovery, and have used supervised leaming methods for class prediction and 
outcome correlations that have identified candidate genes associated with 
classification and outcome. The gene expression profiles in the infant leukemia 
clusters provide some clues to novel and independent etiologies. 

Some genes in these clusters are metabolically related, suggesting that a 

25 metabolic pathway that is associated with cancer initiation or progression. Other 
genes in these metabolic pathways, like the genes described herein but upstream or 
downstream from them in the metabolic pathway, thus can also serve as therapeutic 
targets. 

In yet another aspect, the invention provides genes and gene expression 
30 profiles that discriminate acute myeloid leukemia (AML) fi-om acute lymphoblastic 
leukemia (ALL) in infant leukemias by measuring the expression levels of a gene 
product correlated with ALL or AML. 
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Another aspect of the invention provides genes and gene expression profiles 
that discriminate pre-B lineage ALL from T ALL in pediatric leukemias by measuring 
expression levels of a gene product correlated with pre-B lineage ALL or T ALL. 

It should be appreciated that while the present invention is described primarily 
in terms of human disease, it is useful for diagnostic and prognostic applications in 
other mammals as well, particularly in veterinary applications such as those related to 
the treatment of acute leukemia in cats, dogs, cows, pigs, horses and rabbits. 

Further, the invention provides methods for computational and statistical 
methods for identifying genes, lists of genes and gene expression profiles associated 
with outcome, karyotype, disease subtype and the like as described herein. 

Measurement of gene expression levels 

Gene expression levels are determined by measuring the amount or activity of 
a desired gene product (i.e., an RNA or a polypeptide encoded by the coding sequence 
of the gene) in a biological sample. Any biological sample can be analyzed. 
Preferably the biological sample is a bodily tissue or fluid, more preferably it is a 
bodily fluid such as blood, serum, plasma, urine, bone marrow, lymphatic fluid, and 
CNS or spinal fluid. Preferably, samples containing mononuclear bloods cells and/or 
bone marrow fluids and tissues are used. In embodiments of the method of the 
invention practiced in cell cultvire (such as methods for screening compounds to 
identify therapeutic agents), the biological sample can be whole or lysed cells from 
the cell culture or the cell supernatant. 

Gene expression levels can be assayed qualitatively or quantitatively. The 
level of a gene product is measured or estimated in a sample either directly (e.g., by 
determining or estimating absolute level of the gene product) or relatively (e.g., by 
comparing the observed expression level to a gene expression level of another 
samples or set of samples). Measurements of gene expression levels may, but need 
not, include a normalization process. 

Typically, mRNA levels (or cDNA prepared from such mRNA) are assayed to 
determine gene expression levels. Methods to detect gene expression levels include 
Northem blot analysis (e.g., Harada et al.. Cell 63:303-312 (1990)), SI nuclease 
mapping (e.g., Fujita et al.. Cell 49:357-367 (1987)), polymerase chain reaction 
(PGR), reverse transcription in combination with the polymerase chain reaction (RT- 
PCR) (e.g.. Example III; see also Makino et al.. Technique 2:295-301(1990)), and 
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reverse transcription in combination with the ligase chain reaction (RT-LCR). 
Multiplexed methods that allow the measurement of expression levels for many genes 
simultaneously are preferred, particularly in embodiments involving methods based 
on gene expression profiles comprising multiple genes. In a preferred embodiment, 
5 gene expression is measured using an oligonucleotide microarray, such as a DNA 
microchip, as described in the examples below. DNA microchips contain 
oligonucleotide probes affixed to a solid substrate, and are useful for screening a large 
number of samples for gene expression. 

Alternatively or in addition, polypeptide levels can be assayed. 
10 Immunological techniques that involve antibody binding, such as enzyme linked 

immunosorbent assay (ELISA) and radioimmunoassay (RIA), are typically employed. 
Where activity assays are available, the activity of a polypeptide of interest can be 
assayed directly. 

The observed expression levels for the gene(s) of interest are evaluated to 
15 determine whether they provide diagnostic or prognostic information for the leukemia 
being analyzed. The evaluation typically involves a comparison between observed 
gene expression levels and either a predetermined gene expression level or threshold 
value, or a gene expression level that characterizes a control sample. The control 
sample can be a sample obtained from a normal (i.e., non-leukemic patient) or it can 
20 be a sample obtained from a patient vAth a known leukemia. For example, if a 

cyto genie classification is desired, the biological sample can be interrogated for the 
expression level of a gene correlated with the cytogenic abnormality, then compared 
with the expression level of the same gene in a patient known to have the cytogenetic 
abnormality (or an average expression level for the gene that characterizes that 
25 population). 

Treatment of infant leukemia and pediatric ALL 

The genes identified herein that are associated v^dth outcome and/or specific 
disease subtypes or karyotypes are likely to have a specific role in the disease 
30 condition, and hence represent novel therapeutic targets. Thus, another aspect of the 
invention involves treating infant leukemia and pediatric ALL patients by modulating 
the expression of one or more genes described herein. 

In the case of OPAL 1 /GO, whose increased expression above threshold values 
is associated with a positive outcome, the treatment method of the invention involves 
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enhancing OPALl/GO expression. For a number of the gene products identified 
herein increased expression is correlated with positive outcomes in leukemia patients. 
Thus, the invention includes a method for treating leukemia, such as infant leukemia 
and/or pediatric ALL, that involves administering to a patient a therapeutic agent that 
5 causes an increase in the amount or activity of OPALl/GO and/or other polypeptides 
of interest that have been identified herein to be positively correlated with outcome. 
Preferably the increase in amount or activity of the selected gene product is at least 
1 0%, preferably 25%, most preferably 1 00% above the expression level observed in 
the patient prior to treatment. 

10 The therapeutic agent can be a polypeptide having the biological activity of 

the polypeptide of interest (e.g., an OPALl/GO polypeptide) or a biologically active 
subunit or analog thereof. Alternatively, the therapeutic agent can be a ligand (e.g., a 
small non-peptide molecule, a peptide, a peptidomimetic compound, an antibody, or 
the like) that agonizes (i.e., increases) the activity of the polypeptide of interest. For 

15 example, in the case of OPALl/GO, which is postulated to be a membrane-boimd 
protein that may function as a receptor or signaling molecule, the invention 
encompasses the use of a proline-rich ligand of the WW-binding protein 1 to agonize 
OPALl/GO activity. 

Gene therapies can also be used to increase the amount of a polypeptide of 

20 interest, such as OPALl/GO in a host cell of a patient. Polynucleotides operably 
encoding the polypeptide of interest can be delivered to a patient either as "naked 
DNA" or as part of an expression vector. The term vector includes, but is not limited 
to, plasmid vectors, cosmid vectors, artificial chromosome vectors, or, in some 
aspects of the invention, viral vectors. Examples of viral vectors include adenovirus, 

25 herpes simplex virus (HSV), alphavirus, simian virus 40, picomavirus, vaccinia virus, 
retrovirus, lentivirus, and adeno-associated virus. Preferably the vector is a plasmid. 
In some aspects of the invention, a vector is capable of replication in the cell to which 
it is introduced; in other aspects the vector is not capable of replication. In some 
preferred aspects of the present invention, the vector is unable to mediate the 

30 integration of the vector sequences into the genomic DNA of a cell. An example of a 
vector that can mediate the integration of the vector sequences into the genomic DNA 
of a cell is a retroviral vector, in which the integrase mediates integration of the 
retroviral vector sequences. A vector may also contain transposon sequences that 
facilitate integration of the coding region into the genomic DNA of a host cell. 
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Selection of a vector depends upon a variety of desired characteristics in the 
resulting construct, such as a selection marker, vector replication rate, and the like. 
An expression vector optionally includes expression control sequences operably 
linked to the coding sequence such that the coding region is expressed in the cell. The 
5 invention is not limited by the use of any particular promoter, and a wide variety is 
known. Promoters act as regulatory signals that bind RNA polymerase in a cell to 
initiate transcription of a downstream (3* direction) operably linked coding sequence. 
The promoter used in the invention can be a constitutive or an inducible promoter. It 
C2in be, but need not be, heterologous with respect to the cell to which it is introduced. 

10 Another option for increasing the expression of a gene like OPAL 1 /GO 

wherein higher expression levels are predictive for outcome is to reduce the amount 
of methylation of the gene. Demethylation agents, therefore, can be used to re- 
activate expression of OPAL/GO in cases where methylation of the gene is responsible 
for reduced gene expression in the patient. 

1 5 For other genes identified herein as being correlated without outcome in infant 

leukemia or pediatric ALL, high expression of the gene is associated with a negative 
outcome rather than a positive outcome. An example of this type of gene is PBKl. 
These genes (and their associated gene products) accordingly represent novel 
therapeutic targets, and the invention provides a therapeutic method for reducing the 

20 amount and/or activity of these polypeptides of interest in a leukemia patient. 

Preferably the amount or activity of the selected gene product is reduced to at least 
90%, more preferably at least 75%, most preferably at least 25% of the gene 
expression level observed in the patient prior to treatment 

A cell manufactures proteins by first transcribing the DNA of a gene for that 

25 protein to produce RNA (transcription). In eukaryotes, this transcript is an 

unprocessed RNA called precursor RNA that is subsequently processed (e.g. by the 
removal of introns, splicing, and the like) into messenger RNA (mRNA) and finally 
translated by ribosomes into the desired protein. This process may be interfered with 
or inhibited at any point, for example, during transcription, during RNA processing, 

30 or during translation. Reduced expression of the gene(s) leads to a decrease or 
reduction in the activity of the gene product. 

The therapeutic method for inhibiting the activity of a gene whose expression 
is correlated with negative outcome involves the administration of a therapeutic agent 
to the patient. The therapeutic agent can be a nucleic acid, such as an antisense RNA 
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or DNA, or a catalytic nucleic acid such as a ribozyme, that reduces activity of the 
gene product of interest by directly binding to a portion of the gene encoding the 
enzyme (for example, at the coding region, at a regulatory element, or the like) or an 
RNA transcript of the gene (for example, a precursor RNA or mRNA, at the coding 
5 region or at 5* or 3' untranslated regions) (see, e.g., Golub et al., U.S. Patent 

Application Publication No. 2003/0134300, published July 17, 2003). Alternatively, 
the nucleic acid therapeutic agent can encode a transcript that binds to an endogenous 
RNA or DNA; or encode an inhibitor of the activity of the polypeptide of interest. It 
is sufficient that the introduction of the nucleic acid into the cell of the patient is or 

10 can be accompanied by a reduction in the amount and/or the activity of the 

polypeptide of interest. An RNA aptamer can also be used to inhibit gene expression. 
The therapeutic agent may also be protein inhibitor or antagonist, such as small non- 
peptide molecule such as a drug or a prodrug, a peptide, a peptidomimetic compound, 
an antibody, a protein or fusion protein, or the like that acts directly on the 

1 5 polypeptide of interest to reduce its activity. 

The invention includes a pharmaceutical composition that includes an 
effective amount of a therapeutic agent as described herein as well as a 
pharmaceutically acceptable carrier. Therapeutic agents can be administered in any 
convenient manner including parenteral, subcutaneous, intravenous, intramuscular, 

20 intraperitoneal, intranasal, inhalation, transdermal, oral or buccal routes. The dosage 
administered will be dependent upon the nature of the agent; the age, health, and 
weight of the recipient; the kind of concurrent treatment, if any; frequency of 
treatment; and the effect desired. A therapeutic agent identified herein can be 
administered in combination with any other therapeutic agent(s) such as 

25 immunosuppressives, cytotoxic factors and/or cytokine to augment therapy, see Golub 
et al, Golub et al., U.S. Patent Application Publication No. 2003/0134300, published 
July 17, 2003, for examples of suitable pharmaceutical formulations and methods, 
suitable dosages, treatment combinations and representative delivery vehicles. 

The effect of a treatment regimen on an acute leukemia patient can be assessed 

30 by evaluating, before, during and/or after the treatment, the expression level of one or 
more genes as described herein. Preferably, the expression level of gene(s) associated 
with outcome, such as OPAL 1 /GO, Gl and/or G2 are monitored over the course of the 
treatment period. Optionally gene expression profiles showing the expression levels 
of multiple selected genes associated vsdth outcome can be produced at different times 
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during the course of treatment and compared to each other and/or to an expression 
profile correlated with outcome. 

Screening for therapeutic agents 
5 The invention further provides methods for screening to identify agents that 

modulate expression levels of the genes identified herein that are correlated with 
outcome, risk assessment or classification, cytogenetics or the like. Candidate 
compounds can be identified by screening chemical libraries according to methods 
well known to the art of drug discovery and development (see Golub et al., U.S. 

10 Patent Application Publication No. 2003/0134300, published July 17, 2003, for a 

detailed description of a wide variety of screening methods). The screening method 
of the invention is preferably carried out in cell culture, for example using leukemic 
cell lines that express known levels of the therapeutic target, such as OPAL 1 /GO. The 
cells are contacted with the candidate compound and changes in gene expression of 

1 5 one or more genes relative to a control culture are measured. Alternatively, gene 
expression levels before and after contact with the candidate compound can be 
measured. Changes in gene expression indicate that the compound may have 
therapeutic utility. Structural libraries can be surveyed computationally after 
identification of a lead drug to achieve rational drug design of even more effective 

20 compounds. 

The invention further relates to compounds thus identified according to the 
screening methods of the invention. Such compounds can be used to treat infant 
leukemia and/or pediatric ALL, as appropriate, and can be formulated for therapeutic 
use as described above. 

25 

OPALl polynucleotide, polypeptide and antibody 

The invention includes novel nucleotide sequences foimd to be strongly 
associated with outcome in pediatric ALL, as well as the novel polypeptides they 
encode. These sequences, which we originally called "GO" but now have named 
30 OPALl for Outcome Predictor in Acute Leukemia, appear to be associated with 

alternatively spliced products of a large and complex gene. Alternate 5' exon usage 
likely causes the production of more than one distinct protein fi-om the genomic 
sequence. We have now fully cloned both the genomic and cDNA sequences (SEQ 
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ID NO: 16) of OPALl . Expression levels of OPAL 1 /GO that are high in relation to a 
predetermined threshold or a control sample are indicative of good prognosis. 

Nucleotide sequences (SEQ ID NOs:l and 3) encoding two alternatively 
spliced forms of the polypeptide gene product, OPALl /GO, are shown in Fig. 2. The 
5 putative amino acid sequences (SEQ ID NOs:2 and 4) of the two forms of protein 
OPALl /GO are also shown in Fig. 2. Analysis of the protein sequence suggests that 
OPALl /GO may be a transmembrane protein with a short (53 amino acid) 
extracellular domain and an intracellular domain. Both the short extracellular and 
longer intracellular domains have proline-rich regions that are homologous to proteins 

10 that bind WW domains such as the WBP-1 Domain-Binding Protein 1 located at 

human chromosome 2pl2 (MIM #60691; WBPl in HUGO; UniGene Hs. 7709). Like 
SH3 domans in proteins, WW domains interact with proline-rich transcription factors 
and cytoplasmic signaling molecules (such as OPALl/GO) to mediate protein-protein 
interactions regulating gene expression and cell signaling. The data suggest that this 

1 5 novel coding sequence encodes a signaling protein having a W W-binding domain and 
it likely plays an important role in regulation of these cellular processes. 

The present invention also includes polypeptides with an amino acid sequence 
having at least about 80% amino acid identity, at least about 90% amino acid identity, 
or about 95% aniino acid identity with SEQ ID NO:2 or 4. Amino acid identity is 

20 defined in the context of a comparison between an amino acid sequence and SEQ ID 
NO:2 or 4, and is determined by aligning the residues of the two amino acid 
sequences (i.e., a candidate amino acid sequence and the amino acid sequence of SEQ 
ID NO: 2 or 4) to optimize the number of identical amino acids along the lengths of 
their sequences; gaps in either or both sequences are permitted in making the 

25 alignment in order to optimize the number of identical amino acids, although the 
amino acids in each sequence must nonetheless remain in their proper order. A 
candidate amino acid sequence is the amino acid sequence being compared to an 
amino acid sequence present in SEQ ID NO:2 or 4. A candidate amino acid sequence 
can be isolated from a natural source, or can be produced using recombinant 

30 techniques, or chemically or enzymatically synthesized. Preferably, two amino acid 
sequences are compared using the Blastp program of the BLAST 2 search algorithm, 
as described by Tatusova et al. (FEMS Microbiol. Lett., 174:247-250, 1999, and 
available on the world wide web at ncbi.nlm.nih.gov/gorfybl2.html). Preferably, the 
default values for all BLAST 2 search parameters are used, including matrix = 
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BLOSUM62; open gap penalty =11, extension gap penalty = 1 , gap x dropoff = 50, 
expect =10, wordsize = 3, and optionally, filter on. In the comparison of two amino 
acid sequences using the BLAST2 search algorithm, amino acid identity is referred to 
as "identities." A polypeptide of the present invention that has at least about 80% 
5 identity with SEQ ID NO:2 or 4 also has the biological activity of OPALl/GO. 

The polypeptides of this aspect of the invention also include an active analog 
of SEQ ID NO:2 or 4. Active analogs of SEQ ID NO:2 or 4 include polypeptides 
having amino acid substitutions that do not eliminate the ability to perform the same 
biological function(s) as OPALl/GO. Substitutes for an amino acid may be selected 

10 from other members of the class to which the amino acid belongs. For example, 
nonpolar (hydrophobic) amino acids include alanine, leucine, isoleucine, valine, 
proline, phenylalanine, tryptophan, and tyrosine. Polar neutral amino acids include 
glycine, serine, threonine, cysteine, tyrosine, aspartate, and glutamate. The positively 
charged (basic) amino acids include arginine, lysine, and histidine. The negatively 

1 5 charged (acidic) amino acids include aspartic acid and glutamic acid. Such 

substitutions are known to the art as conservative substitutions. Specific examples of 
conservative substitutions include Lys for Arg and vice versa to maintain a positive 
charge; Glu for Asp and vice versa to maintain a negative charge; Ser for Thr so that a 
fi-ee -OH is maintained; and Gin for Asn to maintain a free NH2. 

20 Active analogs, as that term is used herein, include modified polypeptides. 

Modifications of polypeptides of the invention include chemical and/or enzymatic 
derivatizations at one or more constituent amino acids, including side chain 
modifications, backbone modifications, and N- and C- terminal modifications 
including acetylation, hydroxylation, methylation, amidation, and the attachment of 

25 carbohydrate or lipid moieties, cofactors, and the like. 

The present invention further includes polynucleotides encoding the amino 
acid sequence of SEQ ID NO:2 or 4. An example of the class of nucleotide sequences 
encoding the polypeptide having SEQ ID NO:2 is SEQ ID NO:l; and an example of 
the class of nucleotide sequences encoding the polypeptide having SEQ ID NO:4 is 

30 SEQ ID NO:3. The other nucleotide sequences encoding the polypeptides having 

SEQ ID NO:2 or 4 can be easily determined by taking advantage of the degeneracy of 
the three letter codons used to specify a particular amino acid. The degeneracy of the 
genetic code is well known to the art and is therefore considered to be part of this 
disclosure. The classes of nucleotide sequences that encode SEQ ID NO:2 and 4 are 
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large but finite, and the nucleotide sequence of each member of the classes can be 
readily determined by one skilled in the art by reference to the standard genetic code. 

The present invention also includes polynucleotides with a nucleotide 
sequence having at least about 90% nucleotide identity, at least about 95% nucleotide 
5 identity, or about 98% nucleotide identity with SEQ ID NO:l or 3. Nucleotide 

identity is defined in the context of a comparison between an nucleotide sequence and 
SEQ ID NO: 1 or 3, and is determined by aligning the residues of the two nucleotide 
sequences (i.e., a candidate nucleotide sequence and the nucleotide sequence of SEQ 
ID NO:l or 3) to optimize the number of identical nucleotides along the lengths of 

10 their sequences; gaps in either or both sequences are permitted in making the 

alignment in order to optimize the number of identical nucleotides, although the 
nucleotides in each sequence must nonetheless remain in their proper order. A 
candidate nucleotide sequence is the nucleotide sequence being compared to an 
nucleotide sequence present in SEQ ID NO:2 or 4. A candidate nucleotide sequence 

15 can be isolated fi-om a natural source, or can be produced using recombinant 
techniques, or chemically or enzymatically synthesized. Percent identity is 
determined by aligning two polynucleotides to optimize the number of identical 
nucleotides along the lengths of their sequences; gaps in either or both sequences are 
permitted in making the alignment in order to optimize the number of shared 

20 nucleotides, although the nucleotides in each sequence must nonetheless remain in 
their proper order. For example, the two nucleotide sequences are readily compared 
using the Blastn program of the BLAST 2 search algorithm, as described by Tatusova 
et al. {FEMS Microbiol Lett, 174:247-250, 1999). Preferably, the default values for 
all BLAST 2 search parameters are used, including reward for match =1, penalty for 

25 mismatch = -2, open gap penalty = 5, extension gap penalty = 2, gap x_dropoflF= 50, 
expect =10, wordsize =11, and filter on. 

Examples of polynucleotides encoding a polypeptide of the present invention 
also include those having a complement that hybridizes to the nucleotide sequence 
SEQ ID NO:l or 3 imder defined conditions. The term "complement" refers to the 

30 ability of two single stranded polynucleotides to base pair with each other, where an 
adenine on one polynucleotide will base pair to a thymine on a second polynucleotide 
and a cytosine on one polynucleotide will base pair to a guanine on a second 
polynucleotide. Two polynucleotides are complementary to each other when a 
nucleotide sequence in one polynucleotide can base pair with a nucleotide sequence in 
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a second polynucleotide. For instance, 5*-ATGC and 5*-GCAT are complementary. 
As used herein, "hybridizes," "hybridizing," and "hybridization" means that a single 
stranded polynucleotide forms a noncovalent interaction with a complementary 
polynucleotide under certain conditions. Typically, one of the polynucleotides is 
5 immobilized on a membrane. Hybridization is carried out under conditions of 

stringency that regulate the degree of similarity required for a detectable probe to bind 
its target nucleic acid sequence. Preferably, at least about 20 nucleotides of the 
complement hybridize with SEQ ID NO:l or 3, more preferably at least about 50 
nucleotides, most preferably at least about 1 GO nucleotides. 

10 Also provided by the invention is an OPAL 1 /GO antibody, or antigen-binding 

portion thereof, that binds the novel protein OPAL 1 /GO. OPAL 1 /GO antibodies can 
be used to detect OPAL 1 /GO protein; they are also useful therapeutically to modulate 
expression of the OPAL 1 /GO gene. An antibody may be polyclonal or monoclonal. 
Methods for making polyclonal and monoclonal antibodies are well known to the art. 

1 5 Monoclonal antibodies can be prepared, for example, using hybridoma techniques, 
recombinant, and phage display technologies, or a combination thereof. See Golub et 
al., U.S. Patent Application Publication No. 2003/0134300, published July 17, 2003, 
for a detailed description of the preparation and use of antibodies as diagnostics and 
therapeutics. 

20 Preferably the antibody is a human or humanized antibody, especially if it is to 

be used for therapeutic purposes. A human antibody is an antibody having the amino 
acid sequence of a human immunoglobulin and include antibodies produced by 
human B cells, or isolated from human sera, human inmiunoglobulin libraries or from 
animals transgenic for one or more human immunoglobulins and that do not express 

25 endogenous immxmoglobulins, as described in U.S. Pat. No. 5,939,598 by 

Kucherlapati et al., for example. Transgenic animals (e.g., mice) that are capable, 
upon immimization, of producing a full repertoire of human antibodies in the absence 
of endogenous immunoglobulin production can be employed. For example, it has 
been described that the homozygous deletion of the antibody heavy chain joining 

30 region (J(H)) gene in chimeric and germ-line mutant mice results in complete 
inhibition of endogenous antibody production. Transfer of the himian germ-line 
immimoglobulin gene array in such germ-line mutant mice will result in the 
production of human antibodies upon antigen challenge (see, e.g., Jakobovits et al., 
Proc. Natl. Acad. Sci. U.S.A., 90:2551-2555 (1993); Jakobovits et al.. Nature, 



362:255-258 (1993); Bruggemann et al.. Year in Immuno., 7:33 (1993)). Human 
antibodies can also be produced in phage display libraries (Hoogenboom et al., J. 
Mol. Biol., 227:381 (1991); Marks et al., J. Mol. Biol., 222:581 (1991)). The 
techniques of Cote et al. and Boemer et al. are also available for the preparation of 
5 human monoclonal antibodies (Cole et al.. Monoclonal Antibodies and Cancer 
Therapy, Alan R. Liss, p. 77 (1985); Boemer et al., J. Immunol., 147(l):86-95 
(1991)). 

Antibodies generated in non-human species can be "humanized" for 
administration in humans in order to reduce their antigenicity. Hximanized forms of 

10 non-human (e.g., murine) antibodies are chimeric immunoglobulins, immunoglobulin 
chains or fragments thereof (such as Fv, Fab, Fab', F(ab*)2, or other antigen-binding 
subsequences of antibodies) which contain minimal sequence derived from non- 
human immunoglobulin. Residues from a complementary determining region (CDR) 
of a human recipient antibody are replaced by residues from a CDR of a non-human 

15 species (donor antibody) such as mouse, rat or rabbit having the desired specificity. 
Optionally, Fv framev^ork residues of the human immimoglobulin are replaced by 
corresponding non-human residues. See Jones et al.. Nature, 321 :522-525 (1986); 
Riechmann et al.. Nature, 332:323-327 (1988); and Presta, Curr. Op. Struct. Biol., 
2:593-596 (1992). Methods for humanizing non-human antibodies are well known in 

20 the art. See Jones et al.. Nature, 321 :522-525 (1986); Riechmann et al., Nature, 

332:323-327 (1988); Verhoeyen et al.. Science, 239:1534-1536 (1988); and (U.S. Pat. 
No. 4,816,567). 

Laboratory applications 

25 The present invention further includes a microchip for use in clinical settings 

for detecting gene expression levels of one or more genes described herein as being 
associated with outcome, risk classification, cytogenics or subtype in infant leukemia 
and pediatric ALL. In a preferred embodiment, the microchip contains DNA probes 
specific for the target gene(s)- Also provided by the invention is a kit that includes 

30 means for measuring expression levels for the polypeptide product(s) of one or more 
such genes, preferably OPAL/GO, Gl, G2, FYN binding protein, PBKl, or any of the 
genes listed in Table 42. In a preferred embodiment, the kit is an immunoreagent kit 
and contains one or more antibodies specific for the polypeptide(s) of interest. 

27 



EXAMPLES 
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The present invention is illustrated by the following examples. It is to be 
understood that the particular examples, materials, amounts, and procedures are to be 
interpreted broadly in accordance with the scope and spirit of the invention as set 
forth herein 

EXAMPLE lA. 
Laboratory Methods and Cohort Design 



Leukemia Blast Purification, RNA Isolation, Amplification and Hybridization to 
Oligonucleotide Arrays 

Laboratory techniques were developed to optimize sample handling and 
processing for high quality microarray studies for gene expression profiling in 

1 5 leukemia samples. Reproducible methods were developed for leukemia blast 

purification, RNA isolation, linear amplification, and hybridization to oligonucleotide 
arrays. Our optimized approach is a modification of a double amplification method 
originally developed by Ihor Lemischka and colleagues from Princeton University 
(Ivanova et al.. Science 298(5593):601-604 (2002)). 

20 Total RNA was isolated from leukemic blasts using Qiagen Rneasy. An 

average of 2 x 10^ cells were used for total RNA extraction with the Qiagen RNeasy 
mini kit (Valencia, CA). The yield and integrity of the purified total RNA were 
assessed with the RiboGreen assay (Molecular Probes, Eugene, OR) and the RNA 
6000 Nano Chip (Agilent Technologies, Palo Alto, CA), respectively. 

25 Complementary RNA (cRNA) target was prepared from 2.5 ng total RNA 

using two rounds of Reverse Transcription (RT) and In Vitro Transcription (IVT). 
Following denaturation for 5 minutes at 70°C, the total RNA was mixed with 100 
pmol T7- (dT) 24 oligonucleotide primer (Genset Oligos, La Jolla, CA) and allowed 

to anneal at 42°C. The mRNA was reverse transcribed with 200 units Superscript 11 
30 (Invitrogen, Grand Island, NY) for 1 hour at 42''C. After RT, 0.2 volume 5X second 
strand buffer, additional dNTP, 40 units DNA polymerase I, 10 units DNA ligase, 2 
units RnaseH (Invitrogen) were added and second strand cDNA synthesis was 
performed for 2 hours at 16^C. After T4 DNA polymerase (10 units), the mix was 



28 



incubated an additional 10 minutes at 16**C. An equal volume of 
phenol:chlorofomi:isoamyl alcohol (25:24: l)(Sigma, St. Louis, MO) was used for 
enzyme removal. The aqueous phase was transferred to a microconcentrator 
(Microcon 50. Millipore, Bedford, MA) and washed/concentrated with 0.5 ml DEPC 
5 water twice the sample was concentrated to 10-20 uL The cDNA was then 

transcribed with T7 RNA polymerase (Megascript, Ambion, Austin, TX) for 4 hr at 
37^C. Following IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, 
washed and concentrated to 10-20ul. 

The first round product was used for a second round of amplification which 
10 utilized random hexamer and T7- (dT) 24 oligonucleotide primers, Superscript II, two 

RNase H additions, DNA polymerase I plus T4 DNA polymerase finally and a biotin- 
labeling high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, NY). 
The biotin-labeled cRNA was purified on Qiagen RNeasy mini kit columns, eluted 
with 50ul of 45°C RNase-firee water and quantified using the RiboGreen assay. 

15 Following RNA isolation and cRNA amplification using two rounds of poly 

dT primer-anchored Reverse Transcription and T7 RNA polymerase transcription, 
RNA and cRNA quality was assessed by capillary electrophoresis on Agilent RNA 
Lab-Chips. After the quality check on Agilent Nano 900 Chips, 15ug cRNA were 
fi-agmented following the Affymetrix protocol (Affymetrix, Santa Clara, CA). The 

20 fi-agmented RNA was then hybridized for 20 hours at 45°C to HG_U95Av2 probes. 
The hybridized probe arrays were washed and stained with the EukGE_WS2 fluidics 
protocol (Affymetrix), including streptavidin phycoerythrin conjugate (SAFE, 
Molecular Probes, Eugene, OR) and an antibody amplification step (Anti-streptavidin, 
biotinylated. Vector Labs, Burlingame, CA). HG_U95Av2 chips were scanned at 488 

25 nm, as recommended by Affymetrix. The expression value of each gene was 
calculated using Affymetrix Microarray Suite 5.0 software. 

We routinely obtain 100-200 micrograms of amplified cRNA from 2.5 
micrograms of leukemia cell-derived total RNA. Our detailed statistical analysis 
comparing various RNA inputs and single vs. double amplification methods have 

30 shown that this approach leads to an excellent representation of low as well as high 
abundance mRNAs and is highly reproducible. It has the added benefit of not losing 
the representation of low abundance genes fi-equently lost in methods that lack 
amplification or only perform single round amplifications. As only 15 micrograms of 
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cRNA are required per Aflfymetrix chip, we are able to store residual cRNA in 
virtually all cases; this highly valuable cRNA can be used again in the future as array 
platforms and methods of analysis improve. Samples were studied using 
oligonucleotide microarrays containing 12,625 probes (Aflfymetrix U95Av2 array 
5 platform). 

Statistical design 

We designed two retrospective cohorts of pediatric ALL patients registered to 
clinical trials previously coordinated by the Pediatric Oncology Group (POG): 1) a 

10 cohort 127 infant leukemias (the "infant" data set); and 2) a case control study of 254 
pediatric B-precursor and T cell ALL cases (the "preB" dataset). These samples were 
obtained from patients with long term follow up who were registered to clinical trials 
completed by the Pediatric Oncology Group (POG). In the analysis of gene 
expression profiles for classification and particularly outcome prediction, it is 

1 5 essential to integrate gene expression data with laboratory parameters that impact the 
quality of the primary data, and to make sure that any derived cluster or gene list 
cannot be accounted for by variations in laboratory methodology. Thus we tracked 
and annotated our gene expression data set with all of the laboratory correlates shown 
below. 

20 

Laboratory Correlates 
Vial Date = Sample Collection Date Value 
Percent Leukemic Blasts in Sample = Integer 
Sample Viability = Integer 
25 RNA Method = Boolean 
RNA Quality = Boolean 

RNA Starting Amount = Amount Amplified (Floating Point) 
Experimental Set = 16/ Arrays per Set (Integer) 
Amplification Date = Date Value (Linked to Reagent Lot) 
30 aRNA Quality = Quality of Amplified RNA 

Clinical, demographic, and outcome data are also essential for predictive profiling. 

Clinical/Patient Sample Correlates 
COG_NO = Patient Identifier (Integer) 
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Study_NO = Treatment Study (Integer) 
AGE_DA YS = Age at Initial Registration (Integer) 
RAC = Patient Race (Strings) 
SX = Patient Sex (String) 
5 WBC_BLD = Presenting Blood Count (Floating Point) 
DUR_CR = Duration of Complete Remission (Days) 
REMISS = (CCR=Continuous Complete Remission) 
FAIL=Failed Therapy; String but representing a Boolean) 
ACH-CR = Achieved Initial CR (String, but Boolean) 
10 DI = DNA Index (Leukemia Cell DNA Amount, Floating) 
KARYOTYP = Cytogenetic Abnormality 

Blinded cohort studies were developed for the conduct of the array experiments. In 
this way, the individuals performing arrays were blinded to all clinical and outcome 
correlative variables. 

15 

For the retropective "infant" study, 142 retrospective cases from two POG 
infant trials (9407 for infant ALL; 9421 for infant AML) were initially chosen for 
analysis. Infants as defined were <365 days in age and had overall extremely poor 
survival rates (<25%). Of the 142 cases, 127 were ultimately retained in the study; 15 

20 cases were excluded from the final analysis due to poor quality total RNA, cRNA 

amplification, or hybridization. Of the final 127 cases analyzed, 79 were considered 
traditional ALL by morphology and immunophenotyping and 48 were considered 
AML. 59/127 of these cases had rearrangements of the MLL gene. 

The 254 member retrospective pre-B and T cell ALL case control study (the 

25 "preB" study) was selected from a number of pediatric POG clinical trials. A cohort 
design was developed that could compare and contrast gene expression profiles in 
distinct cytogenetic subgroups of ALL patients who either did or did not achieve a 
long term remission (for example comparing children vsdth t(4;l 1) who failed vs. 
those who achieved long term remission). Such a design allowed us to compare and 

30 contrast the gene expression profiles associated with different outcomes within each 
genetic group and to compare profiles between different cytogenetic abnormalities. 
The design was constructed to look at a number of small independent case-control 
studies within B precursor ALL and T cell ALL. For the B cell ALL group, the 
representative recurrent translocations included t(4;l 1), t(9;22), t(l;19), monosomy 7, 
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monosomy 21, Females, Males, African American, Hispanic, and AlinCl 5 arm A. 
Cases were selected from several completed POG trials, but the majority of cases 
came from the POG 9000 series, including 8602, 9406, 9005, and 9006 as long term 
follow up was available. 
5 As standard cytogenetic analysis of the samples from patients registered to 

these older trials would not have usually detected the t(12;21), we performed RT-PCR 
studies on a large cohort of these cases to select ALL cases with t(12;21) who either 
failed (n=8) therapy or achieved long term remissions (n=22). Cases who "failed" had 
failed within 4 years while "controls" had achieved a complete continuous remission 
10 of 4 or more years. A case-control study of induction failures (cases) vs. complete 
remissions (CRs; controls) was also included in this cohort design as was a T cell 
cohort. 

It is very important to recognize that the study was designed for efficiency, 
and maximum overlap, without adversely affecting the random sampling assumptions 

1 5 for the individual case-control studies. To design this cohort, the set of all patients 
(irrespective of study) who had inventory in the UNM POG/COG Tissue Repository 
and who had failed within 4 years of diagnosis (cases) were considered. Each such 
case was assigned a random number from zero to one. Cases were then sorted by this 
random number. The same process was applied to the totality of potential controls. 

20 For each case-control study, we then took the first N patients (requested in design) or 
all patients (whichever was smaller), meeting the entry requirements for the particular 
study. By maximizing the overlap in this fashion, a savings of over 20% compared to 
a design that required mutually exclusive entries was achieved. Yet for any given 
case-control study, the patients represent pure random samples of cases and controls. 

25 (For example if the first patient in the sort of the failure group were an African- 
American female Math a t(l;19) translocation, she would participate in at least three 
case control studies). As for the infant leukemia cases, gene expression arrays were 
completed using 2.5 micrograms of RNA per case (all samples had >90% blasts) with 
double linear amplification. All amplified RNAs were hybridized to Affymetrix 

30 U95A.V2 chips. 
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EXAMPLE IB. 
Computational Methods 

The present invention makes use of a suite of high-end analytic tools for the 
5 analysis of gene expression data. Many of these represent novel implementations or 
significant extensions of advanced techniques from statistical and machine learning 
theory, or new data mining approaches for dealing with high-dimensional and 
sparse datasets. The approaches can be categorized into two major groups: 
knowledge discovery environments, and supervised classification methodologies. 

10 

Clustering, Visualization, and Text-Mining 
1. Vxinsight 

Vxinsight is a data mining tool (Davidson et al., J. Intellig. Inform. Sys. 
15 11 :259-285, 1 998; Davidson et al., IEEE Information Visualization 2001, 23-30, 

2001) originally developed to cluster and organize bibliographic databases, which has 
been extended and customized for the clustering and visualization of genomic data. It 
presents an intuitive way to cluster and view gene expression data collected from 
microarray experiments (Kim et al.. Science 293:2087-92, 2001). It can be applied 
20 equally to the clustering of genes {e.g^ in a time-series experiment) or to discover 

novel biologic clusters within a cohort of leukemia patient samples. Similar genes or 
patients are clustered together spatially and represented with a 3D terrain map, where 
the large mountains represent large clusters of similar genes/samples and smaller hills 
represent clusters with fewer genes/samples. The terrain metaphor is extremely 
25 intuitive, and allows the user to memorize the "landscape," facilitating navigation 
through large datasets. 

Vxinsighfs clustering engine, or ordination program, is based on a force-directed 
graph placement algorithm that utilizes all of the similarities between objects in the 
dataset. When applied to gene clustering, for example, the algorithm assigns genes 
30 into clusters such that the sum of two opposing forces is minimized. One of these 
forces is repulsive and pushes pairs of genes away from each other as a function of 
the density of genes in the local area. The other force pulls pairs of similar genes 
together based on their degree of similarity. The clustering algorithm terminates 
when these forces are in equilibrium. User-selected parameters determine the 
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fineness of the clustering, and there is a tradeoff with respect to confidence in the 
reliability of the cluster versus further refinement into sub-clusters that may suggest 
biologically important hypotheses. 

Vxinsight was employed to identify clusters of infant leukemia patients with 
5 similar gene expression pattems, and to identify which genes strongly contributed to 
the separations. A suite of statistical analysis tools was developed for post- 
processing information gleaned from the Vxinsight discovery process. Visual and 
clustering analyses generated gene lists, which when combined with public 
databases and research experience, suggest possible biological significance for those 

10 clusters. The array expression data were clustered by rows (similar genes clustered 
together), and by columns (patients with similar gene expression clustered together). 
In both cases Pearson's R was used to estimate the similarities. Analysis of variance 
(ANOVA) was used to determine which genes had the strongest differences 
between pairs of patient clusters. These gene lists were sorted into decreasing order 

1 5 based on the resulting F-scores, and were presented in an HTML format with links 
to the associated OMIM pages (Online Mendelian Inheritance in Man database, 
available on the world wide web through the National Center for Biotechnology 
Information), which were manually examined to hypothesize biological differences 
between the clusters. Gene list stability was investigated using statistical bootstraps 

20 (Efron, Ann. Statist. 7:1-26, 1979; Hjorth et al.. Computer Intensive Statistical 
Methods, Validation Model Selection and Bootstrap. Chapman & Hall, London, 
1 994). For each pair of clusters 1 00 random bootstrap cases were constructed via 
resampling with replacement from the observed expressions (Fig. 3). Next, the 
resulting ordered lists of genes were determined, using the same ANOVA method 

25 as before. The average order in the set of bootstrapped gene lists was computed for 
all genes, and reported as an indication of rank order stability (the percentile from 
the bootstraps estimates a p- value for observing a gene at or above the list order 
observed using the original experimental values). 

30 2. Principal Component Anzdysis 

Principal component analysis (PCA) is a well-known and convenient method 
for performing unsupervised clustering of high-dimensional data. Closely related to 
the Singular Value Decomposition (SVD), PCA is an imsupervised data analysis 
technique whereby the most variance is captured in the least number of coordinates. 
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It can serve to reduce the dimensionality of the data while also providing significant 
noise reduction. It is a standard technique in data analysis and has been vadely 
applied to microarray data. Recently (Raychaudhuri et al., Pac. Symp. Biocomput., 
5:455-466, 2002) PCA was used to analyze cell cycles in yeast (Chu et al.. Science, 
282:699-705, 1998; Spellman et al., Mol. Biol. Cell, 9:3273-97, 1998); PCA has 
also been applied to clustering (Hastie et al.. Genome Biology 1 :research0003, 
2000; Holter et al., Proc. Natl. Acad. Sci., 97:8409-14, 2000); other applications of 
PCA to microarray data have been suggested (Wall et al., Bioinformatics 1 7, 566- 
568, 2001). 

PCA works by providing a statistically significant projection of a dataset onto 
an orthonormal basis. This basis is computed so that a variety of quantities are 
optimized. In particular we have (Kirby, Geometric Data Analysis, John Wiley & 
Sons, New York, 2001): 

*» maximization of the statistical variance, 

* minimization of mean square truncation error, 

• maximization of the mean squared projection, 
® minimization of entropy. 

Furthermore, the PCA basis optimizes these quantities by dimension. In other 
words, the first PCA basis vector provides the best one-dimensional projection of 
the data subject to the above conditions, the first and second PCA basis vectors 
provide the best two-dimensional projection, et cetera. The PCA basis is typically 
computed by solving an eigenvalue problem closely related to the SVD (Kirby, 
Geometric Data Analysis. John Wiley & Sons, New York, 2001; Trefethen et al.. 
Numerical Linear Algebra, SLAM, Philadelphia, 1997). Consequently, the PCA 
basis vectors are often called eigenvectors; in the context of microarray data they 
are occasionally called eigen-genes, eigen-arrays, or eigen-patients. PCA is 
typically illustrated by finding the major and minor axes in a cloud of data filling an 
ellipse. The first eigenvector corresponds to the major axis of the ellipse while the 
second eigenvector corresponds to the minor axis. PCA is used to analyze the 
principal sources of error in microarray experiments, and to perform variance 
analysis of VxInsight-derived clusters. 
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Supervised learning methods and feature selection for class prediction 



1 . Bayesian Networks 

The Bayesian network modeling and learning paradigm (Pearl, Probabilistic 
5 Reasoning for Intelligent Systems. Morgan Kaufmann, San Francisco, 1988; 
Heckerman et al.. Machine Learning 20:197-243, 1995) has been studied 
extensively in the statistical machine learning literature. A Bayesian net is a graph- 
based model for representing probabilistic relationships between random variables. 
The random variables, which may, for example, represent gene expression levels, 

10 are modeled as graph nodes; probabilistic relationships are captured by directed 
edges between the nodes and conditional probability distributions associated with 
the nodes. In the context of genomic analysis, this framework is particularly 
attractive because it allows hypotheses of actor interactions (e.g., gene-gene, gene- 
protein, gene-polymorphism) to be generated and evaluated in a mathematically 

1 5 sound manner against existing evidence. Network reconstruction, pathway 

identification, diagnosis, and outcome prediction are among the many challenges of 
current interest that Bayesian networks can address. Introduction of new network 
nodes (random variables) can model effects of previously hidden state variables, 
conditioning prediction on such factors as subject characteristics, disease subtype, 

20 polymorphic information, and treatment variables. 

A Bayesian net asserts that each node (representing a gene or an outcome) is 
statistically independent of all its non-descendants, once the values of its parents 
(immediate ancestors) in the graph are known. Even with the focus on restricted 
subnetworks, the learning problem is enormously difficult, due to the large number 

25 of genes, the fact that the expression values of the genes are continuous, and the fact 
that expression data generally is rather noisy. Our approach to Bayesian network 
leaming employs an initied gene selection algorithm to produce 20-30 genes, with a 
binary binning of each selected gene's expression value. The set of selected genes 
then is searched exhaustively for parent sets of size 5 or less, with the induced 

30 candidate networks being evaluated by the BD scoring metric (Heckerman et al.. 
Machine Leaming 20:197-243, 1995). This metric, along with our variance factor, 
is used to blend the predictions made by the 500 best scoring networks. Each of 
these 500 Bayesian networks can be viewed as a competing hypothesis for 
explaining the current evidence (i.e., training data and prior knowledge) for the 

36 



corresponding classification task, and the gene interactions each suggests are 
potentially of independent interest as well. 

Bayesian analysis allows the combining of disparate evidence in a principled 
way. Abstractly, the analysis synthesizes known or believed prior domain 
5 information with bodies of possibly diverse observational and experimental data 
(e.g., microarrays giving gene expression levels, polymorphism information, 
clinical data) to produce probabilistic hypotheses of interaction and prediction. 
Prior elicitation and representation quantifies the strength of beliefs in domain 
information, allowing this knowledge and observational and experimental data to be 

10 handled in uniform manner. Strong priors are akin to plentiful and reliable data; 
weaker priors are akin to sparse, noisy data. Similarly, observational and 
experimental data can be qualified by its reliability, accuracy, and variability, taking 
into account the different sources that produced the data and inherent differences in 
the natures of the data. Of course, observational and experimental data will 

15 eventually dominate the analysis if it is of sufficient size and quality. 

In the context of outcome and disease subtype prediction, we applied a highly 
customized and extended Bayesian net methodology to high-dimensional sparse 
data sets with feature interaction characteristics such as those found in the genomics 
application. These customizations included the parent-set model for Bayesian net 

20 classifiers, the blending of competing parent sets into a single classifier, the pre- 
filtering of genes for information content, Helman-Veroff normalization to pre- 
process the data, methods for discretizing continuous data, the inclusion of a 
variance term in the BD metric, and the setting of priors. Our normalization 
algorithm is designed to address inter-sample differences in gene expression levels 

25 obtained from the microarray experiments It proceeds by scaling each sample's 
expression levels by a factor derived from the aggregate expression level of that 
sample. In this way, afer scaling, all samples have the same aggregate expession 
level. 

A set of training data, labeled with outcome or disease subtype, was used to 
30 generate and evaluate hypotheses against the training data. A cross validation 

methodology was employed to leam parameter settings appropriate for the domain. 
Surviving hypotheses were blended in the Bayesian framework, yielding conditional 
outcome distributions. Hypotheses so learned are validated against an out-of- 
sample test set in order to assess generalization accuracy. This approach was 
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successfully used to identify OPAL 1 /GO as strong predictors of outcome in 
pediatric ALL as described in Example IL 

2. Support Vector Machines. 
5 Support vector machines (SVMs) are powerful tools for data classification 

(Cristianini et al., An Introduction to Support Vector Machines and Other Kernel- 
Based Learning Methods. Cambridge University Press, Cambridge, 2000; Vapnik, 
Statistical Learning Theory, John Wiley & Sons, New York, 1999). The original 
development of the SVM was motivated, in the simple case of two linearly 

10 separable classes, by the desire to choose an optimal linear classifier out of an 
infinite number of potential linear classifiers that could separate the data. This 
optimal classifier corresponds not only to a hyperplane that separates the classes but 
also to a hyperplane that attempts to be as far away as possible from all data points. 
If one imagines inserting the widest possible corridor between data points (with data 

1 5 points belonging to one class on one side of the corridor and data points belonging 
to the other class on the other side), then the optimal hyperplane would correspond 
to the imaginary line/plane/hyperplane running through the middle of this corridor. 

The SVM has a number of characteristics that make it particularly appealing 
within the context of gene selection and the classification of gene expression data, 

20 namely: SVMs represent a multivariate classification algorithm that takes into 

account each gene simultaneously in a weighted fashion during training, and they 
scale quadratically with the number of training samples, rather than the number 
of features/genes, d. In order to be computationally feasible, other classification 
methods first have to reduce the number of dimensions (features/genes), and then 

25 classify the data in the reduced space. A imivariate feature selection process or 

filter ranks genes according to how well each gene individually classifies the data. 
The overall classification is then heavily dependent upon how successful the 
univariate feature selection process is in pruning genes that have little class- 
distinction information content. In contrast, the SVM provides an effective 

30 mechanism for both classification and feature selection via the Recursive Feature 

Elimination algorithm (Guyon et al.. Machine Learning 46, 389-422, 2002). This is 
a great advantage in gene expression problems where d is much greater than N, 
because the number of features does not have to be reduced a priori. 
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Recursive Feature Elimination (RFE) is an SVM-based iterative procedure 
that generates a nested sequence of gene subsets whereby the subset obtained at 
iteration is contained in the subset obtained at iteration k. The genes that are 
kept per iteration correspond to genes that have the largest weight magnitudes — ^the 
S rationale being that genes with large weight magnitudes carry more information 

with respect to class discrimination than those genes with small weight magnitudes. 
We have implemented a version of S VM-RFE and obtained excellent results — 
comparable to Bayesian nets — ^for a range of infant leukemia classification tasks 
with blinded test sets. 

10 

3. Discriminant Analysis 

Discriminant analysis is a widely used statistical analysis tool that can be 
applied to classification problems where a training set of samples, depending a set of 
p feature variables, is available (Duda et al.. Pattern Classification (Second Edition). 

15 Wiley, New York, 2001). Each sample is regarded as a point in p-dimensional space 
R'', and for a g-way classification problem, the training process yields a discriminant 
mle that partitions R'' into g disjoint regions, /?i /fa, . . Rg. New samples with 
unknown class labels can then be classified based on the region /?, to which the 
corresponding sample vector belongs. In many cases, determining the partitioning is 

20 equivalent to finding several linear or non-linear functions of the feature variables 

such that the value of the function differs significantly between different classes. This 
function is the so-called discriminant function. Discriminant rules fall into two 
categories: parametric and nonpar ametric. Parametric methods such as the maximum 
likelihood rule — including the special cases of linear discriminant analysis (LDA) and 

25 quadratic discriminant analysis (QDA) (Mardia et al., Multivariate Analysis, 

Academic Press, Inc., San Diego, 1979; Dudoit et al., J. Am. Stat. Ass'n. 97(457):77- 
87, 2002) — assume that there is an underlying probability distribution associated with 
each of the classes, and the training samples are used to estimate the distribution 
parameters. Non-parametric methods such as Fisher's linear discriminant and the k- 

30 nearest neighbor method (Duda et al.. Pattern Classification (Second Edition). Wiley, 
New York, 2001) do not utilize parameter estimation of an underlying distribution in 
order to perform classifications based on a training set. 
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In applying discriminant analysis techniques to the gene expression 
classification problem, both categories of methods have been utilized, specifically 
LDA (binary classification) and Fisher's linear discriminant (multi-class problems). 
For the statistically designed infant leukemia dataset, LDA was applied successfully 
5 to the AML/ALL and t(4;l l)/NOT class distinctions. Fisher's linear discriminant 
analysis was further used to identify three well-separated classes that clustered within 
the seven nominal MLL subclasses for which karyotype labels were available. 

For both classes of methods, a major issue is the question of feature selection, 
either as an independent step prior to classification, or as part of the classifier training 

10 step. In addition to a simple ranking based on /-test score as used by other researchers 
(Dudoit et al., J. Am. Stat. Ass'n. 97(457):77-87, 2002), the use of stepwise 
discriminant analysis for determining optimal sets of distinguishing genes has been 
investigated. One challenge in the stepwise approach is the rapid increase of 
computational burden with the number of genes included in the initial set; the method 

15 is therefore being implemented on large-scale parallel computers. An alternative gene 
selection approach that is presently being explored is stepwise logistic regression 
(McCulloch et al.. Generalized, Linear, and Mixed Models Wiley, New York, 2001; 
SAS Online Documentation for SAS System, Release 8.02, SAS Institute, Inc. 2001). 
Logistic regression is known to be well suited to binary classification problems 

20 involving mixed categorical and continuous data or to cases where the data are not 
normally distributed within the respective classes. 

Various extensions of these techniques are expected to enable the 
incorporation of both categorical and continuous data in our classifiers. This enables 
the inclusion of known, discrete clinical labels (age, sex, genotype, white blood count, 

25 etc.) in conjunction with microrarray expression vectors, in order to perform more 
accurate classifications, particularly for outcome prediction. In addition to logistic 
regression as mentioned previously, one approach is to first quantify the categorical 
data (Hayashi, Ann. Inst. Statist.Math. 3:69-98, 1952), and then apply standard non- 
parameteric statistical classification techniques in the usual manner. 

30 

4. Fu2^ Inference 

Traditional classification methods are based on the theory of crisp sets, where 
an element is either a member of a particular set or not. However many objects 
encountered in the real world do not fall into precisely defined membership criteria. 
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Fuzzy inference (also known as fuzzy logic) and adaptive neuro-fuzzy models 
are powerful learning methods for pattern recognition. Although researchers have 
previously investigated the use of fuzzy logic methods for reconstructing triplet 
relationships (activator/repressor/target) in gene regulatory networks (Woolf et al., 
5 Physiol. Genomics 3:9-15, 2000), these techniques have not been previously applied 
to the genomic classification problem. A significant advantage of fuzzy models is 
their ability to deal with problems where set membership is not binary (yes/no); 
rather, an element can reside in more than one set to varying degrees. For the 
classification problem, this results in a model that, like probabilistic methods such 
10 as Bayesian nets, can accommodate data sources that are incomplete, noisy, and 
may ultimately include non-nurrieric text-based expert knowledge derived from 
clinical data; polymorphisms or other forms of genomic data; or proteomic data that 
must be incorporated into the overall model in order to achieve a more accurate 
classification system in clinical contexts such as outcome prediction. 

15 

5. Genetic eilgorithms 

Fuzzy logic and other classification methods require the use of a gene 
selection method in order to reduce the size of the feature space to a numerically 
tractable size, and identify optimal sets of class-distinguishing genes for further 

20 analysis. We are exploring the use of genetic algorithms (GAs) for determining 
optimal feature sets during the training phase of a classification problem. 

A GA is a simulation method that makes it possible to robustly search a very 
large space of possible solutions to an optimization problem, and find candidate 
solutions that are near optimal. Unlike traditional analytic approaches, GAs avoid 

25 "local minimimi" traps, a classic problem arising in high-dimensional search spaces. 
Optimal feature selection for gene expression data where the sample size N is much 
smaller than the nxmiber of features d (for the Affymetrix leukemia data analyzed, 
rf« 12,000 and N-l 00-200) is a classic problem of this type. A genetic algorithm 
code has been developed by us to perform feature selection for the K-nearest 

30 neighbors classification method using the recently proposed GA/KNN approach (Li 
et al., Bioinformatics 17:1 131-42, 2001); this method, which is compute-intensive, 
has been implemented on the parallel supercomputers. The approach has been 
applied recently to the statistically designed infant leukemia dataset, to evaluate 
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biologic clusters discovered using unsupervised learning (Vxinsight). The 
GA/KNN method was able to predict the hypothesized cluster labels (A^B^C) in 
one-vs.-all classification experiments. 



EXAMPLE II. 

Identification of a Gene Strongly Predictive of Outcome in Pediatric Acute 
Lymphoblastic Leukemia (ALL): OPALl 



Summary 

10 To identify genes strongly predictive of outcome in pediatric ALL, we analyzed 

the retrospective case control study of 254 pediatric ALL samples described in 
Example lA. We divided the retrospective POG ALL case control cohort (n=254) 
into training (2/3 of cases, the "preB training set") and test (1/3 of cases, the "preB test 
set") sets, applied a Bayesian network approach, and performed statistical analyses. A 

15 particularly gene predictive of outcome in pediatric ALL was identified, 

corresponding to Affymetrix probe set 38652_at ("GO": Hs. 10346; NM_Hypothetical 
Protein FLJ20154; partial sequences reported in GenBank Accession Number 
NM_017787; NM_017690; XM_053688; NP_060257). Two other genes, Affymetrix 
probe set 3461 0_at ("Gl": GNB2L1: G protein p2, related sequence 1; GenBank 

20 Accession Number NM_006098; ); and Affymetrix probe set 35659_at ("G2": IL-10 
Receptor alpha; GenBank Accession Number U00672), were identified as associated 
with outcome in conjunction with OPALl /GO, but were substantially less significant. 
OPALl /GO, which we have named OPALl for outcome predictor in acute leukemia, 
was a heretofore vmknown human expressed sequence tag (EST), and had not been 

25 fiiUy cloned until now. Gl (G protein ^2, related sequence 1) encodes a novel 
RACK (receptor of activated protein kinase C) protein and is involved in signal 
transduction (Wang et al., Mol Biol Rep. 2003 Mar;30(l): 53-60) and G2 is the well- 
known IL-10 receptor alpha. 

Importantly, we found that OPALl /GO was highly predictive of outcome 

30 (p=.0014) in a completely different set of ALL cases assessed by gene expression 

profiling by another laboratory (the St. Jude set of ALL cases previously published by 
Yeoh et al. (Cancer Cell 1; 133-143, 2002)). We also observed a trend between high 
OPALl /GO and improved outcome in our retrospective cohort of infant ALL cases. 
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We have fully cloned the human homologue of OPAL 1 /GO and characterized its 
genomic structure. OPAL 1 /GO is highly conserved among eukaryotes, maps to 
human chromosome 1 0q24, and appears to be a novel transmembrane signaling 
protein with a short membrane insertion sequence and a potential transmembrane 
domain. This protein may be a protein inserted into the extracellular membrane (and 
function like a signaling receptor) or vs^ithin an intracellular domain. We have also 
developed specific automated quantitative real time RT-PCR assays to precisely 
monitor the expression of OPAL 1 /GO and other genes that we have foimd to be 
associated with outcome in ALL. 

Bayesian networks 

We used Bayesian networks, a supervised learning algorithm as described in 
Example IB, to identify one or more genes that could be used to predict outcome as 
well as therapeutic resistance and treatment failure. To identify genes strongly 
predictive of outcome in pediatric ALL, we divided the retrospective POG ALL case 
control cohort (n=254) described above into training (2/3 of cases) and test (1/3 of 
cases) sets. Computational scientists were blinded to all clinical and biologic co- 
variables during training, except those necessary for the computational tasks. A large 
number of computational experiments were performed, in order to properly sample 
the space of Bayesian nets satisfying the constraints of the problem. In the context of 
high-dimensional gene expression data, the inclusion of more nets than is typical in 
the literature appears to yield better results. Our initial results using Bayesian nets 
showed classification rates in excess of 90-95%. 

Identification of genes associated with outcome 

A particularly strong set of genes predictive of outcome was identified by 
applying a Bayesian network analysis to the preB training set. The three genes in the 
strongest predictive tree identified by Bayesian networks are provided in Table 2. 
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Table 2: Genes Strongly Predictive of Outcome in Pediatric ALL 



Gene 
Identifier: 
Bayesian 
Network 


Aflymetrix 

Oligo 

Sequence 


Gene/Protein Name 


Previously Known 
Function / 
Comment 


GO 


38652_at 


Hs. 10346; 
NMHypothetical 
Protein FLJ20154 


Unknown human 
EST, not previously 
fully cloned. 


Gl 


34610_at 


GNB2L1:G protein P2, 
related sequence 1 


Signal 

Transduction; 
Activator of Protein 
Kinase C 


G2 


35659_at 


IL-10 Receptor alpha 


IL-10 Receptor 
alpha 



Fig.4 shows a graphic representation of statistics that were extracted from the 
5 Bayesian net (Bayesian tree) that show association with outcome in ALL. The circles 
represent the key genes; the lighter arrows pointing toward the left denote low 
expression levels while the darker arrows pointing toward the right denote high 
expression of each gene. The percentage of patients achieving remission (R) or 
therapeutic failure (F) is shown for high or low expression of each gene, along with 

10 the number of patients in each group in parentheses. 

Our analysis showed that pediatric ALL patients whose leukemic cells contain 
relatively high levels of expression of OPAL 1 /GO have an extremely good outcome 
while low levels of expression of OPALl/GO is associated with treatment failure. At 
the top of the Bayesian network, OPALl/GO conferred the strongest predictive power; 

15 by assessing the level of OPALl/GO expression alone, ALL cases could be split into 
those with good outcomes (OPALl/GO high: 87% long term remissions) versus those 
with poor outcomes (OPALl/GO low: 32% long term remissions, 68% treatment 
failure). Detailed statistical analyses of the significance of OPALl/GO expression in 
the retrospective cohort revealed that low OPALl/GO expression was associated with 

20 induction failure (p=.0036) while high OPALl/GO expression was associated with 
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long term event free survival (p=.02), particularly in males (p=.0004). Higher levels 
of OPALl/GO expression were also associated with certain cytogenetic abnormalities 
(such as t(12;21)) and normal cytogenetics. Although the number of cases were 
limited in our initial retrospective cohort, low levels of OPALl/GO appeared to define 
5 those patients with low risk ALL who failed to achieve long term remission, 

suggesting that OPALl/GO may be useful in prospectively identifying children who 
would otherwise be classified as having low or standard risk disease, but who would 
benefit from further intensification. 

The pre-B test set (containing the remaining 87 members of the pre-B cohort) 

10 was also analyzed. Unexpectedly, OPALl/GO when evaluated on the pre B test set 
showed a far less significant correlation with outcome. This is the only one of the 
four data sets (infant, pre-B training set, pre-B test set, and the Downing data set, 
below) in which no correlation was observed. One possible explanation is that, 
despite the fact that the preB data set was split into training and test sets by what 

1 5 should have been a random process, in retrospect, the composition of the test set 

differed very significantly from the training set. For example, the test set contains a 
disproportionately high fraction of studies involving high risk patients with poorer 
prognosis cytogenetic abnormalities which lack OPALl/GO expression; these children 
were also treated on highly different treatment regimens than the patients in the 

20 training set. Thus, there may not have been enough leukemia cases that expressed 
higher OPALl/GO levels (there were only sixteen patients with a high OPALl/GO 
expresion value in the test set) for us to reach statistcal significance. Finally, the p- 
value observed for the preB training set was so strong, as was the validation p-value 
for OPALl/GO outcome prediction in the independent data sets, that it would be 

25 virtually impossible that the observed correlation between OPALl/GO and outcome is 
an artifact. 

In addition, PGR experiments recently completed in accordance with the 
methods outlined in Example III support the importance of OPALl/GO as a predictor 
of outcome. Although a large fraction (30%) of the 253 pre B cases could not be 
30 assessed by PGR due to sample availability, including 8 of the 36 cases from the pre 
B training set in which OPALl/GO was highly expressed, an initial analysis of the 
results on the 1 74 cases which could be assessed supports a clear statistical correlation 
between OPALl/GO and outcome (a p-value of about 0.005 on the PGR data alone, 
when the OPALl/GO-high threshold is considered fixed). It should be noted that 
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these PCR samples cut across the pre B training and test sets, and that the PGR 
resuhs do not seem to reflect the same dichotomy in training and test set correlation as 
was seen in the microarray data. Furthermore, the RNA target for the PCR assays 
(directly amplified cDNA) and the Aflffymetrix array experiments (linearly amplified 
twice cDNA) are quite different and it is satisfying that a moderately strong 
correlation (r = 0.62) was observed between these two quite distinct methodologies to 
quantitate gene expression. Additionally, in a random re-sampling (bootstrap) 
procedure reported in herein, OPAL 1 /GO does exhibit consistent significance. 

As noted above, we evaluated expression levels of OPAL 1 /GO in three entirely 
different and disjoint data sets. Two of the data sets, described above, were derived 
fi*om retrospective cohorts of pediatric ALL patients registered to clinical trials 
previously coordinated by the Pediatric Oncology Group (POG): the statistically 
designed cohort of 127 infant leukemias (the "infant" data set); and the statistically 
designed case control study of 254 pediatric B-precursor and T cell ALL cases (the 
"pre-B" data set), specifically the 167 member "pre-B" training set. The third data set 
evaluated was a publicly available set of ALL cases previously published by Yeoh et 
al. (the "Downing" or "St. Jude" data set) (Cancer Cell 1; 133-143, 2002). 

The following breakdown was conditioned on OPAL 1 /GO expression level at 
its optimal threshold value, which in all data sets examined fell near the top quarter 
(22-25%) of the expression values. Low OPAL 1 /GO expression was defined as 
having normalized OPALl/GO expression below this value, while high OPALl/GO 
expression was defined as having normalized OPALl/GO expression equal to or 
greater than this value. 

Of the 167 members of the pre-B training set, 73 (44%) were classified as 
CCR (continuous complete remission) while 94 (56%) were classified as FAIL. 
Relative to the optimized threshold value, OPALl/GO expression was determined to 
be low in 131 samples and high in 36 samples. The following statistics were 
observed. 

Low OPALl/GO expression (131 samples): 
CCR: 42 32% 
FAIL: 89 68% 
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High OPAL 1 /GO expression (36 samples): 
CCR: 31 86% 
FAIL: 5 14% 

5 The following p- values were observed for gene uncorrelated with outcome 

possessing any threshold point yielding our observations or better: 

By Chi-squared: p-value 1.2 * 10'^(-7) (approximately 1 in ten million) 
By TNoM: p-value 5.7 * 10'^(-7) (approximately 1 in two million). 

10 

where TNoM refers Threshold Number of Misclassifications = the number of 
misclassifications made by using a single-gene classifier with an optimally chosen 
threshold for separating the classes. 

1 5 The significance of these p- values must be assessed in light of the fact that 

12,000+ genes can be so considered (individually) against the training data. Even 
with L25 X 10"* candidate genes, under the null hypothesis of no associations, the 
expected number of genes that possess a threshold yielding our observation (or better) 
is still extremely small: 

20 

By Chi-squared: ( 1.2 * 10^(-7) ) * ( 1.25 * lOM ) = 1.5 * 10^(-3) 
By TNoM: ( 5.7 * 10^(-7) ) * ( 1.25 * lOM ) = 7.5 * 10^(-3) 

Hence, one would expect to have to search approximately 667 independent data sets, 
25 each similar in composition to our pre-B training set (each consisting of 1 .25* 1 OM 

candidate genes and 167 cases), in order to find even a single gene in one of these 667 
data sets possessing a threshold yielding our observations or better as measured by 
Chi-squared, due to chance alone. (Using the p-value obtained from the TNoM 
statistic, we would expect to have to search 133 similar, independent data sets to find 
30 even a single gene possessing a threshold yielding a TNoM score at least as good as 
our observation.) These p-values are highly significant and support the conclusion 
that the observed statistical correlations are real, with high confidence. 

Our analysis of the pre-B training set showed that pediatric ALL patients 
whose leukemic cells contain relatively high levels of expression of OPAL 1 /GO have 
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an extremely good outcome while low levels of expression of OPAL 1 /GO is 
associated with treatment failure. In the entire pediatric ALL cohort under analysis, 
44% of the patients were in long term remission for 4 or more years, while 56% of the 
patients had failed therapy within 4 years. At the top of the Bayesian network, 
OPAL 1 /GO conferred the strongest predictive power; by assessing the level of 
OPAL 1 /GO expression alone, ALL cases could be split into those with good outcomes 
(OPAL 1 /GO high: 87% long term remission; 13% failures) versus those with poor 
outcomes (OPAL 1 /GO low: 32% long term remissions, 68% treatment failure). 
Although the numbers are quite small as we continue down the Bayesian tree, 
outcome predictions can be somewhat refined by analyzing the expression levels of 
these Gl and G2. 

We also investigated OPAL 1 /GO expression level statistics across biological 
classifications typically utilized as predictive of outcome. The following represents a 
breakdown of OPAL 1 /GO expression statistics within various subpopulations of the 
pre-B training set. The OPAL 1 /GO threshold obtained by optimization in the original 
pre-B training set analysis (a value of 795) was used. 

Normal Genotype (65 members) 

Outcome statistics 
26 CCR 40% 
39 FAIL 60% 

Low OPAL 1 /GO expression (51 samples) 
13 CCR 25% 
38 FAIL 75% 

High OPAL 1 /GO expression (14 samples) 
13 CCR 93% 
1 FAIL 7% 
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t(12:21) (equivalent to TEL/AMLl in Downing data set, below) (24 members) 



Outcome statistics 
18 CCR 75% 
5 6 FAIL 25% 



Low OPALl/GO expression (bottom 78%; 10 samples) 
6 CCR 60% 
4 FAIL 40% 

10 

High OPALl/GO expression (top 22%; 14 samples) 
12 CCR 86% 
2 FAIL 14% 



15 

Hyperdiploid (17 members) 

Outcome statistics 
9 CCR 53% 
20 8 FAIL 47% 



Low OPALl/GO expression (13 samples) 
5 CCR 38% 
8 FAIL 62% 

High OPALl/GO expression (4 samples) 
4 CCR 100% 
0 FAIL 0% 



30 t(4:l 1) and t(l:19) combined (35 members) 



Outcome statistics 
13 CCR 37% 
22 FAIL 63% 



Low OPAL 1 /GO expression (34 samples) 
13 CCR 38% 
21 FAIL 62% 

High OPAL 1 /GO expression (1 sample) 

0 CCR 0% 

1 FAIL 100% 

t(9:22) and hypodiploid combined (12 members) 

Outcome statistics 

2 CCR 17% 
10 FAIL 83% 

Low OPAL 1 /GO expression (12 samples) 
2 CCR 17% 
10 FAIL 83% 

High OPAL 1 /GO expression (0 samples) 
0 CCR -- 
0 FAIL -- 

Low Age ( <= 10 years ) (109 members) 

Outcome statistics 
55 CCR 50% 
54 FAIL 50% 

Low OPAL 1 /GO expression (80 samples) 
30 CCR 38% 
50 FAIL 62% 

High OPAL 1 /GO expression (29 samples) 
25 CCR 86% 
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4 FAIL 14% 

High Age ( > 10 years ) (58 members) 

Outcome statistics 
18 CCR 31% 
40 FAIL 69% 

Low OPAL 1 /GO expression (51 samples) 
12 CCR 24% 
39 FAIL 76% 

High OPAL 1 /GO expression (7 samples) 
6 CCR 86% 
1 FAIL 14% 

Low WBC ( <= 50,000 ) (79 members) 

Outcome statistics 

39 CCR 49% 

40 FAIL 51% 

Low OPAL 1 /GO expression (58 samples) 
21 CCR 36% 
37 FAIL 64% 

High OPAL 1 /GO expression (21 samples) 
18 CCR 86% 
3 FAIL 14% 

High WBC ( > 50,000 ) (88 members) 

Outcome statistics 

34 CCR 39% 
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54 FAIL 61% 



Low OPAL 1 /GO expression (73 samples) 
21 CCR 29% 
5 52 FAIL 71% 



High OPAL 1 /GO expression (15 samples) 
13 CCR 87% 
2 FAIL 13% 

10 

The data evidence a number of interesting interactions between OPAL 1 /GO 
and various parameters used for risk classification (karyotype and NCI risk criteria). 
Age and WBC (White Blood Count), in particular, are routinely used in the current 
risk stratification standards (age > 10 years or WBC > 50,000 are high risk), yet 

1 5 OPAL 1 /GO appears to be the dominant predictor within both of these groups. Indeed, 
OPAL 1 /GO appears to "trump" outcome prediction based on these biological 
classifications. In other words, regardless of biological classification, roughly the 
same OPAL 1 /GO statistics are observed. For example, even though MLL 
translocation t( 12:21) is generally associated with very good outcome, when 

20 OPALl/GO is low, the t(12:21) outcome is not nearly as good as when OPALl/GO is 
high. This association is also present in the Downing data set (see below), according 
to our analysis, although it was not recognized by Yeoh et al. 

In our retrospective cohort balanced for remission/failure, OPALl/GO was 
more frequently expressed at higher levels in ALL cases with normal karyotype 

25 (14/65, 22%), t(12;21) (14/24, 58%) and hyperdiploidy (4/17, 24%%) compared to 
cases v^th t(l ;19) (2%) and t(9;22) (0%). 86% of ALL cases with t(12;21) and high 
OPALl/GO achieved long term remission; while t(12;21) with low OPALl/GO had 
only a 40% remission rate. Interestingly, 100% of hyperdiploid cases and 93% of 
normal karyotype cases with high OPALl/GO attained remission, in contrast to an 

30 overall remission rate of 40% in each of these genetic groups. 

Although our cases numbers were small and the cases highly selected, there 
appeared to be a correlation between low OPALl/GO and failure to achieve remission 
in children with low risk disease, suggesting that OPALl/GO may be useful in 
prospectively identifying children with low or standard risk disease who would 
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benefit from further intensification. Interestingly, in children in the standard NCI risk 
group (age <1 0; WBC < 50,000) and an overall remission rate of 50% in this case 
control study, children with high OPAL 1 /GO had an 86% long term remission rate. 
Even children with NCI high risk criteria (age > 10, WBC > 50,000) and an overall 
5 remission rate of 3 1 % in this selected cohort, children with high OPALl/GO had an 
87% remission rate. Finally, OPALl/GO was also highly predictive of outcome in T 
ALL (p=.02), as well as B precursor ALL. 

Our statistical analyses of the significance of OPALl/GO expression in the 
retrospective cohort revealed that low OPALl/GO expression was associated with 

10 induction failure (p=.0036) while high OPALl/GO expression was associated with 
long term event free survival (p=.02), particularly in males (p=.0004). Interestingly, 
actual quantitative levels of OPALl/GO appeared to be important and there was a 
clear expression threshold between remission and relapse. 

To further validate the role of OPALl/GO in outcome prediction in ALL, we 

15 tested the usefulness of OPALl/GO on two additional independent set of ALL cases, 
the statistically designed infant ALL cohort described above, and the publicly 
available St. Jude ALL dataset (Yeoh et al.. Cancer Cell 1 ; 133-143, 2002). In these 
two data sets, it should be noted that we explored OPALl/GO's statistics specifically, 
and (in this context) did not test any other gene. Hence, the significance of the p- 

20 values computed for these two additional data sets should not be balanced against a 
large number of potential candidate genes. There was only one gene considered, and 
that was OPALl/GO. Further, the threshold was fixed using the top 22% (17 samples) 
expressors as the threshold, not optimized as it was in the aneilysis of the pre-B 
training set. 

25 Of the 76 members of the infant ALL data set (restricted to no-marginal 

ALLs), 29 (38%) were classified as CCR (continuous complete remission) while 47 
(62%) were classified as FAIL. The following statistics were observed. 

Low OPALl/GO expression (bottom 78%; 59 samples) 
30 CCR: 19 32% 

FAIL: 40 68% 

High OPALl/GO expression (top 22%; 17 samples) 
CCR: 10 59% 
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FAIL: 7 41% 



By Chi-squared: p-value ~= 0.0465 
By TNoM : p-value 0.0453 

5 

For the Downing data set, "Heme Relapse" and "Other Relapse" were 
classified as FAIL and the 2nd AML was discarded as being of indeterminate 
outcome. Of the 232 members of the Downing data set, 201 (87%) were classified as 
CCR (continuous complete remission) while 31 (13%) were classified as FAIL. The 
10 following statistics were observed. 

Low OPAL 1 /GO expression (bottom 78%; 181 samples) 
CCR: 150 83% 
FAIL: 31 17% 

15 

High OPAL 1 /GO expression (top 22%; 51 samples) 
CCR: 51 100% 
FAIL: 0 0% 

20 By Chi-squared: p-value 0.0014 

TNoM is NA because same majority class in both groups 

An additional result against the Downing data set is that if the threshold is lowered 
slightly to include in the high group the top 25% of expressors (that is, 8 additional 
25 cases are above the OPAL 1 /GO threshold), we obtained: 

Low OPAL 1 /GO expression (bottom 75%; 173 samples) 
CCR: 142 82% 
FAIL: 31 18% 

30 

High OPAL 1 /GO expression (top 25%; 59 samples) 
CCR: 59 100% 
FAIL: 0 0% 
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By Chi-squared: p-value 0.0004 
TNoM is NA because same majority class in both groups 

The more reflective p- value apparently lies closer to p = 0.0004 than to 0.0014, since 
the threshold point is only a small distance from the predetermined 22% point and is 
characterized by a large gap in OPAL 1 /GO expression values. 

It should be noted that all three of these data sets are totally disjoint, and as a 
result the latter two studies represent independent validation of the statistics observed 
in the original "pre-B" training set evaluation. As previously discussed, Yeoh et al. 
were not able to identify or validate genes associated with outcome in the St. Jude 
dataset. The St. Jude data set was not balanced for remission versus failure; the 
overall long term remission rate in this series of cases was 87%. Additionally, Yeoh 
et al. employed SVMs which included many genes in the classification that masked 
the significance of OPAL 1 /GO. Oiu- adapted BD metric controlled model complexity 
and allowed the significance of OPAL 1 /GO to be realized in this data set. Indeed, we 
found that 100% of the cases in this St. Jude series with higher levels of OPAL 1 /GO, 
regardless of karyotype, achieved long term remissions (p=.0014). 

The following represents a breakdown of OPAL 1 /GO expression statistics 
within various subpopulations of the Downing data set. The OPAL 1 /GO threshold 
(25%) obtained by optimization in the original pre-B training set analysis was used. 
This yields 59 high OPAL/GO cases in total, which are distributed among the various 
subgroups as follows: 

TEL-AMLl (61 members) 

Outcome statistics 
57 CCR 93% 
4 FAIL 7% 

Low OPAL 1 /GO expression (7 samples) 

3 CCR 43% 

4 FAIL 57% 
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High OPAL 1 /GO expression (54 samples) 
54 CCR 100% 
0 FAIL 0% 

Hyperdiploid > 50 (48 samples) 

Outcome statistics 
43 CCR 90% 
5 FAIL 10% 

Low OPAL 1 /GO expression (46 samples) 
41 CCR 89% 
5 FAIL 11% 

High OPAL 1 /GO expression 
2 CCR 100% 
0 FAIL 0% 

Hyperdiploid 47-50 (19 members) 

Outcome statistics 
19 CCR 100% 
0 FAIL 0% 

Low OPAL 1 /GO expression (18 samples) 
18 CCR 100% 

0 FAIL 0% 

High OPAL 1 /GO expression (1 sample) 

1 CCR 100% 
0 FAIL 0% 
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Pseudodiploid (21 members) 



10 



Outcome statistics 
19 CCR 90% 
2 FAIL 10% 

Low OPAL 1 /GO expression (19 samples) 
17 CCR 89% 
2 FAIL 11% 

High OPAL 1 /GO expression (2 samples) 
2 CCR 100% 
0 FAIL 0% 



15 As noted above, these data support the association of OPAL 1 /GO with outcome across 
biological classifications, as noted above for the pre-B training set. 

Cloning and Characterization of OPALl/GO 

The human homologue of OPALl/GO was fully cloned and its genomic 

20 structure characterized. OPALl/GO is highly conserved among eukaryotes, maps to 
human chromosome 1 0q24, and appears to be a novel, potentially transmembrane 
signaling protein. To clone OPALl/GO, RACE PCR was used to clone upstream 
sequences in the cDNA using lymphoid cell line RNAs. The genomic structure was 
derived from a comparison of OPALl/GO cDNAs to contiguous clones of germline 

25 DNA in GenBank. The total predicted mRNA length is approximately 4 kb (Fig. 2C; 
SEQ ID NO: 16). We have developed very specific primers and probes to measure 
OPALl/GO (as well as Gl and G2) (see Example III) both qualitatively and 
quantitatively using PCR techniques. 

Interestingly, preliminary studies reveal that the gene for OPALl/GO encodes 

30 two different RNAs (and potentially up to five different RNAs through alternative 
splicing of upstream exons) and presumably two different proteins based on 
altemative use of 5' exons (la and 1). These two different transcripts are differentially 
expressed in leukemia cell lines. 
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Fig.5 is schematic drawing of the structure of OPAL 1 /GO. OPAL 1 /GO is 
encoded by four different exons and was cloned using RACE PGR from the 3' end of 
the gene using the Affymetrix oligonucleotide probe sequence (38652_at); 
interestingly the oligonucleotide (overlining labeled "Affy probes") designed by 
5 Affymetrix from EST sequences tums out to be in the extreme 3* untranslated region 
of this novel gene. The predicted coding region is shown as underlining for each 
exon. The location of primers we developed for use in quantitative detection of 
transcripts are shown as arrows above the exons. 

Interestingly, OPAL 1 /GO appears to encode at least two different proteins 
10 through alternative splicing of different 5' exons (1 and la). Fig. 2 A shows the 
nucleotide sequence (SEQ ID NO:l) and putative amino acid sequence (SEQ ID 
NO:2) of OPAL 1 /GO. (including exon 1), and Fig. 2B shows the nucleotide sequence 
(SEQ ID NO:3) and putative amino acid sequence (SEQ ID NO:4) of OPALl/GO 
(including exon la). 

1 5 Table 3 shows the results of RT-PCR assays performed in accordance with 

Example III that confirm alternative exon use in OPALl/GO. While all leukemia cell 
lines (REH, SUPB15) contained an OPALl/GO transcript with exons 2-3 and with 
exon la fused to exon 2; only V2 of the cell lines and the primary human ALL samples 
isolated to date express the alternative transcript (exon 1 fused to exon 2). 

20 

Table 3. RT-PCR assays of alternative exon use in OPALl/GO. 
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GO 


Cell line 


exon 1-2 


exon1a-2 


exon 2-3 


SUPB15t(9;22)e1a2 




+ 


+ 


REH t(12;21) 


+ . 


+ 


■+ 


K562 t(9;22) b3a2 


+ 


+ 


+ 


BV173 t(9;22) b2a2 




+ 


+ 


697 t(1;19) 




.+ 


. + 


NB-4t(15;17) 




+ 


+ 


MV411 t(4;11) 




+ 


. + 


size 


154 


158 


166 


predicted 


148 


155 


-168 


















100 ng equivalent RNA into each reaction 
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OPAL 1 /GO appears to be rather ubiquitously expressed and it has a highly similar 
murine homologue. Preliminary examination of the translated coding sequence (Fig. 
2) reveals a novel protein with a signal peptide, a short sequence (53 amino acids) 
which may be inserted in either the plasma membrane and be extracellular, or inserted 
5 within an intracellular membrane; a potential transmembrane domain; and an 

intracellular domain. Within the intracellular domain there are proline-rich regions 
that have strong homologies to proteins that bind WW domains and which are 
referred to as WW-binding protein 1 (WBP, see above). WW domains mediate 
interactions between proline-rich transcription factors and cytoplasmic signaling 
10 molecules. The data suggest that that this novel gene encodes a signaling protein, 
which may function as a receptor depending on its cellular location. 

Characterization of Gl and G2 

Gl encodes an interesting protein, a G protein p2 homologue that has been 

15 linked to activation of protein kinase C, to inhibition of invasion, and to 

chemosensitivity in solid tumors. It is also interesting that the Bayesian tree linked 
G2 (the IL-10 receptor a) to Gl and OPAL 1 /GO, as the interleukin IL-10 has been 
previously linked to improved outcome in pediatric ALL (Lauten et al.. Leukemia 
16:1437-1442, 2002; Wu et al.. Blood Abstract, Blood Supplement 2002 (Abstract 

20 #3017)). IL-10 has been shown to be an autocrine factor for B cell proliferation and 
also to suppress T cell immune responses. ALL blasts that express a shortened, 
alternatively spliced form of IL-10 have been shown to have significantly better 5 
year EFS (p=.01) (Wu et al.. Blood Abstract, Blood Supplement 2002 (Abstract 
#3017).). We have developed specific primers and probes to assess the direct 

25 expression of each of these genes in large ALL cohorts (Example III). 

EXAMPLE III. 

RT-PCR for Analysis of Expression Levels of OPALl/GO, Gl, G2 and other Genes of 

Interest 

30 

We have developed direct RT-PCR assays to precisely measure the 
quantitative expression of these genes in an efficient two step approach. First, we 
perform a "quEilitative" screen for positive c£ises using non-quantitative "end-point" 
RT-PCR assays with rapid and very inexpensive detection using the Agilent 

59 



bioanalyzer. Positive cases detected with this simple, rapid, and highly sensitive 
methodology are then targeted for precise quantitative assessment of a particular gene 
using automated quantitative real time RT-PCR (Taqman technology). 

Sequences for OPAL 1 /GO (both splice forms) and pseudogenes identified 
5 from the other chromosomes were aligned, and OPAL 1 /GO primers were designed to 
maximize the differences between the true OPAL 1 /GO genes and the pseudogenes. 
The primers and probe sequences developed for specific quantitative assessment of 
the two alternatively spliced forms of OPAL 1 /GO (assessed by quantifying mRNAs 
with exon 1 fused to exon 2 or alternatively exon la fused to exons 2) are: 

10 

For exon 1 or la to 2 (the (+) primeris are sense and the (-) are antisense):: 
Exon 1(+) 

CCAACGTTAGTGTGGACGATGC (SEQ ID NO:5) 
15 Exon la(+) 

GCATGGCGCTCCTGCTC (SEQ ID NO:6) 
Exon 2(-) 

GTAGTAGTTGCAGCACTGAGACTG (SEQ ID NO:7) 
Exon 2 probe (5' FAM/3' TAMRA) 
20 CCACAGCAGTGTCCTGTGTCACAGATGTAGC (SEQ ID NO:8) 

For exon 2 to 3: 

Exon2 (+)a 

CAGTCTCAGTGCTGCAACTACTAC (SEQ ID NO:9) 
25 Exon 3(-) 

GGCTTCTCGGTAAGCGATCAG (SEQ ID NO: 10) 
Exon 3 probe (5' FAM/3' TAMRA) 

CTCAGGATGATGATGATGGTCCACACCAGCC (SEQ ID NO:l 1) 

30 Using these primers and probes, we have developed highly sensitive and specific 
automated quantitative assays for OPAL 1 /GO expression over a wdde expression 
range. A standard curve was derived for the automated quantitative RT-PCR assays 
for the two alternatively spliced forms of OPALl/GO. The assays were performed in 
cell lines shown in Table 3 and are highly linear over a large dynamic range. 
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The primers and probe sequences developed for specific quantitative 
assessment of Gl (G protein p2) and G2 (ILlORa) are: 



Gl : spans 2 introns (1 .9 kb and 0.3 kb); from exon 3 to exon 5; 278 bp amplicon 
5 Gle3(+) 

CCAAGGATGTGCTGAGTGTGG (SEQ ID NO: 12) 
Gle5 (-) 

CGTGTTCAGATAGCCTGTGTGG (SEQ ID NO: 13) 

10 G2: spans 1 intron of 3.6 kb; from exon 3 to exon 4; 189 bp amplicon 
G2e3 (+) 

CCAACTGGACCGTCACCAAC (SEQ ID NO: 14) 
G2e4 (-) 

GAATGGCAATCTCATACTCTCGG (SEQ ID NO: 1 5) 

15 

Automated Quantitative RT-PCR 

We routinely develop fluorogenic RT-PCR assays to detect the presence of 
leukemia-associated human genes, as well as viral genes, using an automated, closed 
analysis system (ABI 7700 Sequence Detector, PE-Applied Biosystems Inc., Foster 

20 City, CA). Accurate standards of cloned cDNAs containing the gene or sequence of 
interest are prepared in plasmid vectors (pCR 2.1, Invitrogen). These standard 
reagents are quantitated by fluorescence spectrometry and serially diluted over a six 
log range. Queintitative PCR is carried out in triplicate in the ABI 7700 instrument in a 
96 well plate format, with optimized PCR conditions for each assay. The reverse 

25 transcriptase reaction employs 1 p,g of RNA in a 20 |Lil volume consisting of Ix Perkin 
Elmer Buffer II, 7.5 mM MgCb, 5 [iM random hexamers, 1 mM dNTP, 40U RNasin 
and lOOU MMLV reverse transcriptase. The reaction is performed at 25 ®C for 10 
minutes, 48°C for 60 min and 95*^0 for 10 min. 4.5 jil of the resulting cDNA is used 
as template for the PCR. This is added to IX Taqman Universal PCR Master Mix (PE 

30 Applied Biosystems, Foster City, CA), 100 nM fluorescently labeled Taqman probe 
and 100 nM of each primer in a 50 \i\ volume. The PCR is performed in the PRISM 
7700 Sequence Detector as follows: "hot start" for 10 minutes at 95*'C (with 
AmpliTaq Gold, Perkin-Elmer) then 40 two step cycles of 95°C for 15 seconds and 
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60°C for 1 minute. This system detects the level of fluorescence from cleaved probe 
during each cycle of PGR and constructs the data into an amplification plot. This 
displays the threshold cycle (Cj) of detection for each reaction. The data collection 
and analysis are performed with Sequence Detection System v. 1.6.3 software (PE 
5 Applied Biosystems, Foster City, CA). A standard concentration curve of Cj versus 
initial cDNA quantity is generated and analyzed with the ABI software to conflrm the 
sensitivity range and reproducibility of the assay. To confirm RNA integrity, a 
segment of the ubiquitously expressed E2A gene is also amplified in all patient 
samples, along with a standard E2A or GAPDH cloned cDNA dilution series. This 
10 method can be utilized to quantitatively analyze expression levels for any gene of 
interest. 

EXAMPLE IV 

Supervised Methods for Prediction of Outcome in Pediatric ALL 

15 

Discretization 

First the preB training set was discretized using a supervised method as well 
as an imsupervised discretization. Next p-values were computed by using the formula 
(nr/nh - er)/(er*(l-er)) then determine the likelihood of this value in a t-distribution. 

20 Here nr = number of remissions for gene high, nh = number of cases with gene high, 
and er = expected value of remission (44%). The results were ranked according to 
this p- value, and the preB training set was compared to entire preB data set. The 
results are shown in Tables 4-7. Tables 4 and 6 show two different lists based on the 
training set; Tables 5 and 7 show the entire preB data set for each of the two different 

25 approaches, respectively. Note that OPAL 1 /GO is included on each of these lists as 
correlated vsdth outcome, and there is substantial overlap between and among the lists. 
These lists thus identify potential additional genes that may be associated with 
OPAL 1 /GO metabolically, might help determine the mechanism through which 
OPAL 1 /GO acts, and might identify additional therapeutic or diagnostic genes. 

30 

Cumulative Distribution Functions (CDFs) 

First the Helman-Veroff normalization scheme was applied to the preB 
training set data. Then CDFs were computed, followed by average and maximum 
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difference between the CDFs. The distance between the two CDF curves reflects how 
different the two distributions are, hence the maximum distance and the average 
distance are measures of the way the two set differed. Finally, the genes were ranked 
by average and maximum differences for pre B training set and the entire preB data 
5 set. The results are shown in Tables 8-11. 

The relative expression level for Affymetrix probe 3941 8_at (i.e., 0.5 = half 
the median) was plotted across our pediatric ALL cases organized by outcome: FAIL 
(left panel) or REM (right panel), using Genespring (Silicon Genetics). The results 
showed that this gene's relative expression appears to be higher across failure cases 

10 and lower across remission cases. 

Affymetrix probe 3941 8_at appears to be a probe from the consensus 
sequence of the cluster AJ007398, which includes Homo sapiens mRNA for the 
PBKl protein (Huch et al., Placenta 19:557-567 (1998)). The sequence's approved 
gene symbol is DKFZP564M182, and the chromosomal location is 16pl3.13. 

15 Originally, PBKl was discovered through the identification of differentially expressed 
genes in human trophoblast cells by differential-display RT-PCR Functional 
armotations for the gene that this probe seems to represent are incomplete, however 
the sequence appears to have a protein domain similar to the ribosomal protein LI 
(the largest protein from the large ribosomal subtmit). PBKl may prove to be a useful 

20 therapeutic target for treatment of pediatric ALL. 
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EXAMPLE V. 

SVM Analysis of Pre-B ALL Cohort Data to Discriminate Between Remission and 
Failure and Among Various Karyotypes 

We applied linear SVM, SVM with recursive feature elimination (SVM-RFE), 
and nonlinear SVM methods (polynomial and gaussian) to the pre B training dataset o 
get a list of genes associated with CCR/Fail. Table 12 shows the top 40 genes for 
evaluating remission from failure (CCR vs. FAIL). However, CCR vs. FAIL was 
nonseparable using these methods. 

We also used SVM-RFE to discriminate between members of the data set who 
have the certain MLL translocations from those who do not. Table 1 3 shows the top 
40 genes found to discriminate t(12;21) from not t(12;21) (we excluded patients 
without t(12;21) data from this analysis). Table 14 shows the top 40 genes found to 
discriminate t(l;19) from not t(l;19). We did not see significant separation for 
t(9;22), t(4;l 1) or hyperdiploid karyotypes. 



80 



3 

CM 



(9 

CO 
> 

U 
O 



0) 

o E 

o 



s 



a> 
o 

^§ 

O CTJ 
- S"^ 

E CO ^ 
>. o 



X v; 
CO p. -rr -5^ 



^1 2> 



a> 

Q. 

(/5 o 

c 



CO 



CNJ 



o _ 

O) *<!} 

2 2 - 

- Q. >> 



CO 



CO 



.if 

CO 'q} CO 

CO Q. (0 
C >» C 

CO x: CO 
CNJ 

SCO 

T- CO 
o ^ o 

z z z 



<D 
O 

2> 

o is 

§5 

5 CM 

(O 
CO 

CO -r; 



8 
8 



2 ^ 

Q. 

CO CO 

CO CD 

c c 

CO CO 

CNJ CO 

CO O 

CO 'T- 

o o 



o 

CO 



o 

CO 

E 

CO 
CO 

c 

CO 

s 

5. 



o 



o 



CO 

"c 

Q) 

o 

CO 5>» 
2 CO 8 

0) :g CO 
-C 

0 _Q CD 

1 8 I 

CO CO CO 



CO CO CO 

c c c 

CO CO CO 

o CO 

^ lO CM 

^ o 00 

o o o 



z z z z z 



5 

E 

CD 

E 
a> 

O) 

c 

CO 

1 



si 

E 



o) iS 

1 ^ 

s 1 

«o S 

& .52 .3> 
S £ 

CO CO CO 



o 

CO 
0) 



CD 

Q. 
X 

CD 

E 
o 



s 



-Sa E 
o o 

^ o 

lO <D CM 

S ^ 



CO 

(D 
CO 
CO 

c 

1^ 

<D 
CO 
CO 



CD 
CO 
CO 



*<D 

2 
a> 

CO 

.> 
o 

CO 
I 

c 

CD 

o> 

O 



€0 CO 

c c 

CO CO 



C3> CO 

CM CO CM 

o 

LO 

o o 



CO CO 



CO. CO. 



CO. CO. CO. CO. 



CO 


CO 


CO 


CO 








CO 








>* 


CD 


CO 


CO 


CD 


c 


c 


c 


c 


CO 


CO 


CO 


CO 


CO 


lO 


00 




Oi 




CM 


s 


CO 


o 


CO 






o 


CM 


to 




o 


o 


o 






<=>. 




s s s 


5 


Z 


z 


z 


z 


^« 










to. 


CO 


CO CO 



O 
O 

E 
o 



CO 
CO 
(D 



E 

8 



c 

8 

2 
c 
"o> 
•c 
o 

CO 
CO 

CD 

c 

CO 

CO 

in 
ir> 

CM 

o 
o. 



0) 

8 



CO 

o 



(D 
CO 
CO 



CO 

8 



T3 O 
<D 

d < 

<D 

=S CO 0) 
<3 0) o 
_ CO O 

2 2 ^ 

ip 

CO Q. >» 
2 E <? 

CO Q. ^ 



CO CO €0 

C C C 

CO CO CO 
CNI 

CD CM CM 

CM O) CO 

CM o in 
o o 

555 



0) 

o 

CO Q. 



C 0) 
<D C 

E o .g 

<D ^ (D 

^ CO 



CO 

5:; ^ 
o c 

O CD < 

^ s 



CD 00 
CD CD 
CO O o> 

gas 

555 

z z z 



CO 
•£= 
T5 



O 

E 
o 
< 
z 

CM 

<D 
CO 
CO 

c 

<D 

2 

"O 

>» 

JC 
<D 
T3 

0) 
CO 

E 

CO 



CO 

oo 

o> 
in 
o 
o. 



CM 
w 
O 

8 



O. 



a> 

o. 
I 

O 
■o 

CO .y 
5 ."2 

s 1 

CO Q. 

2 8 



(D O 

CO w 

CO ^ 

2 c: 

Q. O 
X 

CD CO 

J> 2 

2 3 

_ "O 

a> "55 

o o 

>> C 

-c a> 

CO CO 



CO CO 

c c 

CD CO 

O T- 

CM O 
CO 

CD 'T- 

O O 



_ -. CO fc-' ^ I-.' _ 

co' cm' co' tt' o' cm' cm' CO cd' oo' cj>' co' o' T-' o" 

ooinh-'^ino>T-T-r^ I t- co co o co co 
ocDTj-T-cDOcsjcooincoooininc^T-co 

cocococOTfCMcocOcococococococo^co 



CO CO CO CO. CO. CO CO CO 



\ I I 
m CM 

CO 



O-^inCMCDCOCMCO 

h-inin'^^'^-r^cD 
r>-T-ir-ooa»cMoot^«jcor*-c3 
^coooincooinoocMCDcoco 
cocococococococococococo 





CO CO CO CO CO 

CM o> o m CM oo 
o cj> cn CO 

CD CO " 
CO CO 



8§ 



CO 

oo 



o 

(A 
3 

o 

a> 

Q. 

E 

5 



o 
u 

a> 

o. 



■4—' 

2 

Q. 
>< 

CD 

E 



E 
o 

- 0) 

w— -•— « 

■5 ^ 



|1 
E8 

It 
8 8 

CO 2> 

I 1 



Q> 
O 

s 

_ *i> 
o 3 

O) ^ 

2> C 
^ *C 

c iS 

s ? 

.E LU 

CA CO 

ii. J. 

CD CO 

c c 

CD CO 
lO 

o 
oo 1- 

ss 
55 



8 i 

LL. 

rr\ CO 
LU CO 

w SS 

CD "5 



Q. 
CD 



a> 

CA 
CO 

c 

0) 

c 
'c 
o 



CD 

c 

CD 



CO 
O 
O. 



CD. 

00 iO 
CO IO 
CO CM 

^ CO 



« «, <D 

IO 0> CO 

IO CO ^ 

CO g ^ 

CO CO CO 



O 



CO 

> 



I 

CO 

.«8 



c 

0) 

9 

CL 
O 

E 

0) 

c 
o 

CL 
</) 



E 
o 
E 

E 
o 



o 

Q. 
(O 

c 
2 
a> 

CO 

E 
I 

CD 

•«= O 

3 S 

1 S 

e 

0) o 



CO 

E 
o 



(Q 

(D 

JZ 



^ ^ Q. 

0) -Y «i 

^ iS o w 

CO <D O 0> 

_ -fi .>< *;= o 



(O 
CO 



0) c 
E g 

s i 

O 

CO CO 
_>» ^ 

CD CO 

c c 

CO CO 

^3 



o o 
o. o. 



2 2 s 



(O 

s| 

c a> 

11 

S 

2 2 

N CI. 
</> CO 



CO CO 

c c 

CO CO 

CO 

CM 

00 

^ CM 

o o 

55 



CO 

-t a, 
^ TO 

Q- 

o <o 
iE g 

<D ^ 
Q. O 

Q. 

i2 Q. 



CD CD 

c c 

CO CO 

CD lO 

a> CD 
CM 

CO CM 

o o 



^ 5 « 

0 g 8 

y> o c 
5 .E ^ 

:i € S 

1 ^ 

t3 i «> 

i 1 

■s. « ? 

l|i 

<2 CO CO 

'co 'to "<o 

^ ^ 

CD CD CD 

C C C 

CO CO CO 

CM to CO 

o o in 
CD CO 

CD CO CM 

o o o 




X 

o 

SI 

s 

E 
o 



0) 
03 



_CO CO CO 

"co CO CO 

^ ^ ^ 

CO CD CD 

c c: c 

CO CO CO 

in CM CO 

CM CO 

O CM CM 

CM CM t- 

O O O 

z z z 



(O CO CO 

"co "co CO 

>*>»>• 

" CD CO 

C " 
CO 



CO 
c 

CO 



c 

CO 



o 
o 

CM 



CO 

T- 1- CM 

O O T- 



jk: .E 
:= Q) 
_C0 "o 

Is 

Q JC 

(O (O 
"co "co 

>• >* 

CO CO 

c c 

CO CO 
CO GO 

m in 
<j> o 
T- a> 

CM 

55 



O 
CO 
CO 

c 

J5 

=3 

oo 
9 
c 
o 

To 

o 

CL 
CO 
O 

>- 
i 

« 



> 8 

WO)** 

I 

$5 CM O 

^ 1- Q. 

8 5 g 

9 5 -o 

CO CO CO 

'co CO "co 

CO CO CO 

c= c c 

CO CD CO 

CO O CM 

CM 00 

CO in 

o in CO 

o o 
o. o. 



I 

0) 
CO 

% 

■o 
9 
% 

CO 
CD 



c 
*c 
o 



E 
o 
o 

CM 



i 

o 

CO 

<v> 

CO 
X) 

9 

0) 
CO 
CO 



C= CO 

E 

CO CO 

"co CO 

>• 

CD CO 

c c 

CO CO 



CO 

o 

T3 

a> 

s 

CO 



CO 
C3) 

in 

CM CD 

o o 

55 

z z 



III 

O <D 

!=: O y 
O) >o c 

.E ^ 5 
-o ^ 2^ 

1q O CM 

O) ' w 'S 
8 -E o 

CO CO CO 
"co *cO 'co 

CO CO CD 

c c c 

CO CO CO 

1^ CM in 



oo 



OO CM 
CM 

in CM 
o o 



z z z 



0 ^ ^ ^ 
I ^ CO CO CO CO 

'^I ^* p' po' "co. co' r^' cd' cm' 




CO cococococOcoco 

Ico lllllll^ I 
OCOr^ I'^^CDCOCOOOCMCO. CO 



in 

CD 

c 
o 



c 

O) 
CO 
CM 

Q 
O 



CD 
C= 
CD 
CM 



O 

o. 



P 9-^ 



CD Q 



5 

> o) a> 
11? 

ill 



CO 



o 



— a> 

2 £ 

tS < 

<D Z 

c5 a: 

CO CO 



CO CO CO 

'<» "w 'co 

_>»_>» ^ 

CD CO CD 

c c c 

CD CO CD 

0> CO CM 

CO in 

CD CO 

5 3 5 

o o o 

z z z 



CD 

(0 ' m m 



-1- ^, 

_ _ _ CD CD CO 

CO* oo' Co' o' *~| Co' ' <o' 

lO-r-COOCO"^^^ 
COI^CMCOOOCO'T- 
CO CO CO CO T- ^ CO ^ 



o 

o 



CM 
CM 



> 



Cvf 



I 

CO 
CO 



oo 



c 
U) 
o 
c 
a> 
•o 

OS 

o 

O OJ 
0) <J 

Q- r\ 

<s 

Z Q 
(Q CO 

c — 

CO 

CO CO 

CO a> 

oo </> 
to CD 

^ CO 

i-S 

CO a> 
o 



< 
O 



(D 

CO CO 

Q> O 



a> 

CO 

2 r 



E 
'> 
S 

c 

0) 

o 



^ << CD 



m 

<D 
CO 
CO 



I- 

o 



CO 

> 



8- 
I 

C9 



CO 

o 
o. 



< 

<o _ 

*CO CO 
CO CO 

c c 

CO CO 

il 

ID CM 
O O 

z z 



o:: .E 
> E 

C CO 

5 -S 

2 a> 
E 

a. o 

(O u 
•c ^ 

< 



^ E 

•4. CJ) 



CO CO 

'co 'co 

>» 

CO CD 



CO 
oo CD 

"I 

CO CD 

Q O 
CO 
_ <o 

M CD 



CO CO ^ CO 
CO CNJ ™ CM 
O 0) tf> 

g ^ «20 

CM O 

O ' 

z z -S z 



CO 



0) 

2 



o 

CO 



CO 
CD 

c 

CD 

a> 

CO 

8. 



0) 
CO 
CO 

c 

<D 



CO 

2 



u 

Q> 

o 
E 



Q O 
(O 



O 

8 

>- 

CO 

CO 

c 

CO 

oo 

lO 

a> 

o. 



CO 



_ ^ 



CO 
CO 



(1> 
2 

Q. 

o> 
c 

E 
lo 

1 g 

a> lO 
.E u. 

CO CO 
*CO *C0 

>• 

CD CD 

c c 

CO CO 
lO CM 
CM O 
CO 
CO 
T- O 

Z 



o 

(O 

^ c 

B 2 
P °- 



Zi o) 

:o .E 
si 

:w CO 

!^ 

.S2 >* 
« To 
^ c 

CD CD 

CO 

T- CO 

CO o 



< 
•it 

O o 

li 
if 



E 

o _ 

ol 

CO 1 

CO CO 



CD CD 



CO 



CO 

o o 

CM CO 

CD ^ 

lO CM 

O O 

O O 

2 S 



z z E z z 



S 

E 
o> 
E 



E 
o 

■D CO 

Ef 

it 

CD 

E 



CO 



^ 0) 

8 E 
2 2 



CO 

5 >^ 



a> 

it 
11 

«- E 

Q> O 

O CO 
m ^ 

E 

^ ci 

CO 

O) CO 

S 



I- — CD 



E 

JO 



0) 



8 



^ 2 «- 

J- ^ n 



1 - 
ii 

f 8 
8 



« 8 



m (0 



° o 

1 2 



a> 
E 

0) 

E 
a. 

O) ^ 

CO CO 

'co 'co 
^ >• 

CD CD 



o ^ 

Q. 0) 
O 

2 5 

O) 

O 

- 2 

1- 

^ o 

^ c» 

o J5 

s I 

^ 

C CD c 

O Q 

CO CO CO 

'co 'co 'co 
^ ^ ^ 

CD (D CD 



O 

0 o 

1 § 

B 2 
"co ci. 

I":: 

(0 <i> 

>v CD 

8 E 
E E 



CO 

CO c3 

CO CO 

'co 'co 
^ >» 

CD CD 



E 

0> 



o 

c 

E 
E 

CO 



<D 
CO 
CO 

c 

0) 



o 



u S .E <D 

' (D -tS 
(D 

^ O 3 



CM 
O 

O) o 

•If 

E I. 

2 

C3> o 

<D to ' 



OO 



CD 



a> CO 



CO o 

s - 



CO 
C7) 
CO 
CO 



CO 



CO CO CO 



CO CO cO cO 

C7> CO C3) CO lO lO 

lO CO CO O 05 CO 'sr 
_ _CMOCMt-C0OCM 

cDi-iOT-T-Tj-oroir> 

OOOCMOOOOO 

zzzzzzzzz 



"s 2 

CO 

c o> 
o to 

x: a> 

O CO 
CO CO 

CO O 

CO CO 

*co 'co 

2^ ^ 

CO CO 

C C c c c 

CO CO CO CO CO 

00 CO ^ CO lO 

0> 0> CO o ^ 

CO CO CO CM o> 00 

O T- CM XT O CO 
O CM O O O O O 

5555555 

z z z z z z z 



0> 

c 

CO CO CO 

'co *co 'co 
^ ^ 

CD CD (D 



o 

CO 
CO 



O 0) 

I'i 

E 

o Ji 

S E 
2 

a. CD 

CO CO 

'co 'co 

CD CO 

c c 

CO CO 



cO ••-^ ■«>^ 

I CO CO CO 

^1 o' CO oo' co' 
00 oo I 

00 CO CM oo 1^ 

<o a> oo 

^ CO CO CO CO 



CO CO CO CO CO CD CO 

o' '^' CD. co' 1^' CM r-. m 
f- T- I CO CO CO m 

0<00>COO>t-^i- 

Qoa>coc30h-h*-coo 

COCOlOlOCOCOCO'^ 



C0| C0| (Oj C0| COj COj COj C0| CO^ CO^ co^ co^ 

co«pcor>inq>r^cDO>cocOjjrtJD co. "co. ro. co 



«o. 



CO CM CO CM O O T- 

CO 0> 0> CO CO CO 

oo h- lO lO CO CO 

CO CO CO CO CO CO ^ 



^' 05. to 



ooooo>^moo I I IcoS I I 
^-^Trm^cMr^h-r^mcoT-'^ 
r--.co<5ooOf-h-h*ooioio 
CO co^cocor*-«iococoo> oo 



CO 



2 a 



CL 
O 

o ^ 

o c 
«» o 



8 

CO 

E 
% 

CO 

2 

3 
<D 
C 



T3 



a> 



o. o 



O 

0) £ 
-A > 



E 
E 



CD 

E 
o 



T3 
C 
CD 



CD 

1 

3 
CO 

U- 

Q 
CL 



^ 3 

8" '5 



(O CO ^ to 
CO 2 CO 



CD 

CO CO 
O h- 

ir> tn 
o o 



in CM 
CO o> 
o> 

CM T- 
CM 



E 

o 
o 

0) 



(A 
CO 



ay 

CO 

s 

8. 



o 
ST 



CO 

> 



z z z z 



^1 ^ 

^1 CO CO CO CO CO i/> 

t- O) CO f <0 CO CO 

^ CM 0> O CM CM 

00 CO CO ^ CO CO o> 

T- a> CO CO CO lO 

CO CO CO CO CO CO CO 



CO 



ri 



O 



> 

a 
o 

S 

.2 

§ 

o 



< 



S 

o 

CO 



<L> 
CO 

<L> 

g 

GO 
<-» 
CO 



d 

1 



J3 

CO 

2 



2 2 



CO 
CD 
CO 

73 



CO 



(0 

CO 
< 



0) 



o 
o 

u. 
o 

O 

! 

-2 

C9 



o 

JO 

Q. 
0> 
Q. 

o 

Q. 

E 
<u 



8 



o 

CO 
■ 

c 
0) 

2 



0) 5 

to ^ 

(0 CO 

IS E 



0) 

o 

Q. 

in 
o> 
o 

I 

CA 
*CO 
>» 
CO 

c 

CO 

CO 

o> 

CO 



8 



^ 8 

^ a> CO 

<D c a> 

c o) c 

8 Q S 

2 8 



S 



o 

.E CM 

B O _ 
o a: o) 

to 0> CO 



CO CO CO 

lO 0> CO 
XT CO CO 
CO O CD 

sss 



CO CO 

in o 

SCO 

CM 

o o 

z z 



CO CO 

<Dj COj H-j CO CO C0| 

CO CM LO o CM cn 

o> CO CO I oo 



a> 
c 

s 
s 

E 

O p 

ifi o 

2 8 

Q. E 

« S 
.tf ^ 
c o 

E ^ 

0(0-0 

ltl 

g 5 J; 
^ S a 

5 CO X 

CO <M O 

£ ^^--^ 

CO "C CO 

-it 

^ O CO 

CO *^ <r 

O CO ^ 

CO < 

5S « A 

CO 3 

.2 2 o 

2 5 < 

°- ?5 
E Q. o 

CO CO CO 



oo 



o 
o> 

B 
c 

(0 

a. 

CD 



CO 



o E 
y <i) i2 

To 
E 
o 



a> 
E 

c 
a> 
o 



CO c 



0) 
CO 

CO 

c 
2 

1 



8 

0> CD 

.E CO _ 
c P 



i ^ i 1 



iS 8 

Q. U 
CO to 



Q. O 
CO CO 



S CO 

a. -= ^ (i> 

o 2 .1 

C ^ S CO 

.E = 3 o 

Q. CD O) Sr 

a> ^ C ^ 

CO CO CO CO 



CO CO to CO CO 



CO CO CO CO CO CO 



o .E 
E a> 

9- GL 

CO E 
-£ o 

Q. CO 
CO o 
C J3 
CO c 

CO CO 
CO *CO 

>• >* 

CO CO 



o> 

CO 'co 
c o 

.2> "to 

CO o 

.E ^ 
o 



Q. O 

o ^ 

O (O 

II 

O) 

a> S 

CO to 
"co 'co 

>» ^ 

CO CO 



CO 



CO 



CO CO CO 

CO ^ o ^ u> 

Tt ^ <o <0 ay 

CM CD CO CO 

O O ^ CM CM 

O O CD o o 



CO CO 
O CD 



CO CO 
O CM 

o oo 

SCO cj> 

CO T- CM 

O T- o o 



CO CO 

CO 
CD m 

CM CO 

o o 
o o 



2 s s s s 



CO CO CO CO 

lO O lO CO 

CX> f- CM C3> 

OO O 0> CO 

? 5 S " 

s s s s 

Z Z Z Z 



o 
o 



c 

1 

Q. 

V— 

CO 

i ° 

1^ o 
o v£5 

CO CO 

8 1 

■s ^ 

o 

o. 

to CO 

o Q. 

Q. x: 

to CO 

*co *co 
->* ^ 

CO CO 

c c 

CO CO 

ay CD 

CJ> CM 
CO 'i- 
CM CM 

^ o 

S'5 



a> 

CO 
CO 

c 
<1> 



o 
<1) 



CO 

o 
o 

CO 
CO 

o 
o. 



^ CO CO CO 



^ ^ ^1 ^ ^ ^ TO ^ ^ ^ 

^ ««, «, «, TO ^, ^ CO TO TO TO TO TO ^ ^ TO '^i ^. ^, ^ 

TO-^rr^ to to CO Ti- ■ iroTj-cMooTOLO—'inTr lor-ooTOcoTOi^-f-cM 

ICDCM l(OCM<<TO) It^mCO tO C}>CMI^OCMOCM ICM iCM'^tO 

- - — - ■ — — -i ^ ^ •5-ooOf-"^cDcocp"«-or^'^T-cocDcocD»noo 

_ _ _ _ _ CI)^CDCDl0CMC05'~C>OC0C0CM0)C000^Mr 

04coo>rocofsi<o<oco^^<ococoo)cocococoT-^^cocoT— cococococo 



?oiocMM^c3>r^'^r^cpooopr^coo 
CO o> CO o> 



CO 



0) 



CM 

In 
o 

Li- 



as 
« 
E 
o 
<o 
o 

"O 

c 

0) 

>* 

OS 
0) 
CO 

CD 

c 

CD 
CO 
CO 
lO 
CO 

o 



I 

II 



CO 

*(0 „ 

(0 CD 

c c 

CO CO 

CO T- 

0> CM 

co <x> 

O 'c- 

z z 



I CO t— CO 

CM CM CO 0> CO 

CO O C3> 1- 

O) O GO CO CO 

CO ^ CO CO CO 



CO 

o 
o 

I 

iO 



CO 



o 

=3 



E 

Q. 

-9 
<u 
> 

"O 
T3 

TO 
0) 

C= 
g 

a> 



CO 



o 
m 
m 

o 
o. 



o 

=3 
O) 

a> 

(A 

o 

o — 

Q. 0) 
(O Ol 

8 



o 

Q 
< 



CO 3 
Q. Q) 

<l 

O .E 

(D (0 



CD 

00 00 
CO o 

o o 

z z 



o 
^-^ 

8 

CD 



o 
o 

0) 

o 



CO 

o 
o 
o. 



a> 

Q. 
0) 

_>. 
o 

Q. 

<D 
W 
CO 



0) 



S 

00 

ffl 

X 
E 

CA 
O 
lO 

2 

a> 
E 
o 



CO 

"to 
(0 

c 

(O 

1^ 
o> 

s 

o 
o. 



CO 

E 
E 

CD 
O) 
CM 
C 

o 

Q. 



8 "2 



a> 
a> 
c 

> 

CO 
CO 

E 
E 

CO 
05 

CNJ 
GL 
< 

o 



CO 
CO 



CN 
CM 
CM 
CO 
O 

Z 



P 



c 
<D 

E 
_a) 

Ql 

E 
o 
o 
>> 
o 
c 

<D 

*o 
1^ 
a> 



CO 
Q. 



T3 
O 



0) 

E 
a> 
cx 
E 



CO 
CO 

p 



CO 
Q. 
0) 



O 

CO < 

S LU 

CO CO 

"co *co 
_>* 

CO CO 



CO 



CD O 
CO 

in 

o o 

55 

z z 



CO 
CO 



CJ 

2> 

CM 

E 
o 



sl 

O 

"O c 

a> 
o 

CO 
CO 
CO 

6 



CO a> 

I— -•— • 

CO 

<1> o 

c «*> 

C CO 

= CO 



0). 

o 

I 

O) o p 

CL 0, 



CO 

I- ^ 

CO 



8^1 

-7 CO " 

^« 

g»E 

— o 
CO O 
'co -r 

^ d> 

~ Q. 
C7> 



2 2 
g ^ 

> 0) 



CO 

c 

CO 

1^ 



5 CO 



O 

5 

z 



11 



CO 



CO CO 

c c 

CO CO 

CM r-- 

lO o 
to CO 
O CM 

O T- 

55 

Z 2 



■5 c 

1^ 

g CO 

.E o> < 

a> CO 5 

o to 
to to to 

*C0 'co 'co 

>»>*>» 

CO CO CO 

c c c 

CO CO CO 

CD O tT 
CD 

CM CD 

CO TT lO 

555 

z z z 



CO 

s 



i 

E 

CO 

.i < 

o>S E 

CD << CD 

.£ S E 

2> 'a) 'E 
? o § 

J5 ^ CO 

*^ o ^ 
«2 o <i> 

£ CO C7> 

2 S o 

!2 Q. a> 
! 2« 

O CQ CO 
CO CO CO 

J. J. J. 

CO CO CO 

c c c 

CO CO CO 

eg T*- T- 

CO <3) 
00 CD 

00 o 

O T- O 

555 



CO 

a> 

CO ^ 
-C = 
^ CO 

« S3 
•p ^ 

o> g 

1 2> 

<0 CO 

*co 'co 

>» 

CO CO 

c c 

CO CO 
O 

CM 5 

O tf> 

o o 

55 

z z 



o 

CO 

< 
o 

CO 

cc 

JO 

3 

CO 
CO 

> 

CO 

*co 

>» 

CO 

c 

CO 
CO 
CD 

o 
o 

5 

z 



CO CO CO CO CO CO 



"t^^'^lco' o'eo'^'co'o' 

O'«-Tf000)C0CDO'^T-C3C0 
0>Oh<-C0<3>C0^^ 
CMCMCMCOt-OCOCO 
co co co co cm co co 



^ CO CO 

_! CO T- 



^T-OCOCO^COf 
cO ^ ^ CO ^ ^ CO ^ 



CO CO CO CO 
CO lO CM y- 



co CO CO 







^' 














CO 




COj 


CO 


CO 


^ CO 
CO 

to 






«>' 


C^,' 




«>' 


CD 


CO 




a> 


lO 


CO 


CO 


CD 


CD CD 


CM 


s 


in 

CD 


lO 
CD 


s 




CM 




oo 


CO 


CO 


CO 




CO 




^ CO 


CO 



•i e 



OO 




lO 



E 

x> 
c 
!5 

q: 

c 



o 

CO 

CO 
CO 

a> CO 
Q. o 

>» £ 

Q. 

CO 

|i 

(O CO 



(D CO 

c c 

CO CO 
CO -r- 

a> to 

58 

=•5 



00 

CO ca 
1^ 

O CQ 

3f 

•55 > 

e c 

3 £ 
8^ 

Is 
^ s 

^2 

CO CO 

'co *co 
(0 CO 

c c 

CO CO 
0> CM 
0> CD 

o 

CO 

1- o 

it' 

2 2 



c 
o 



o 
'cS 

o 
o 



"'i ^, ^ ^ 

CO. ^ 1-1 CO. CO. 

CM ^ OJ CO CO 

lO OO CM CO CM 0> 

r; CO 5; CO in io 

CO ^ CO CO CO CO CO 



(0 



I 



OB ^ 
C > 

fe CO 



Si 

II 

a> o 

— CO 

0) o 

1€ 

V> CO 



CO CO 

c c 

CO CO 

O 00 

o o 

CD O 

o ^ 

o o 

z z 



CO 

E 
o 



CO 

to 



I 

T5 

c 

CO 



o 

LU 

= 11 

(D C 3 
D) CO O) 

8| £ 

c E c 
o <i> g 

CM E 0) 

O S2 t 
(D S .S 

CO CO CO 



CD CO (O 

c c c 

CO CO CO 

a> CM 

oo ay XT 

O CD ^ 

CM CO CD 

o o o 

z z z 



o 

ts 

c 

c 
lo 

To 

(D 
Q. 

£ 

CM B 

11 

o 

.«e CO 



CO o> 

^ 00 
Q. 

TO r>- 
o 

.52 I 

C CM 
CO p 
(3> E 

f2 o 

o .S2 

51 

z "5 

CO »ff 

If? o> 



P o 

^ s 

OJ CI. 

^ to 



00 CO 

CO o 

g.ss 



CO 

c 

CO 
C3> 

oo 

CD 

o 

1°. 

Q. 

c 
a> 



CO 
CM 

^ o 

Q oo 

oo 

^ CD 

P o 



CO 

a> 

a> c 

§ 2 

^ 73 5 5 

CO CO CO CO 



Z3 C 

O 'c. 
<D C 

Q. ^ 

< ■§ 

<D O 

E 

N CO 

£ ^ 
CO o 

o> E 

CO CO 



CO (0 CD CO 

c c c c 

CO CO CO CO 

CO 0> oo CO 

O CM ^ CM 

O 1^ CM CM 

1^ CD CO O 

o o o o 



CO CO 

c c 

CO CO 
CM 

^ CO 

CD CM 

o o 



91_ 

-^^ 

E-2 
o c 

S s 

E ^ 

Eo 

CM §> 

a> -Q 

.£ o 
E 

^ E 



(I> CM 
CO (I) 

i| 

CO P 
CO 3 

c o 
CO -o 



0) 

w 
>• 
c 
<u 

T5 

-> *o 

z= CM 
<D 55 

CO CO 
CO >» 



CD 

E 



CO 



< 
z 
a: 



CO 
CD 



o 

CM 



e 

o 

-5 



o 

I 

E 

CM 

CD 
CO 
CO 

c 

0) 

o> 
2 

■o 

0) 

■«— « 
CO 



s s « 

-c »S> ci. 

E 2 o 

O Q. ^ 

J3 CD 

CO CO CO 



CD CO CO 

c c c 

CO CO CO 

O <3> 00 

O CM O 

CM CJ> ^ 

r- m o 



u 
2> 

Q. 
CM 

c 

"O) 

"o 

Q. 

o 

0> 
£= 
<D 
O) 

o 

00 sz 
^ o 

-C <D 

CO CO 
(A *CO 

CD (D 



CO ^ 

o c 

CO 
CO 

CO (J 

el 
■it 

o. E 

C3) ^ 
T3 CO 

X 

o c 

^ CO 



z 

CD 
LU 



0) 

TO 



CO 



to CO CO CO CO CO CO CO c 

CO 'co 'co *(0 *CO *CO *(0 *co E 



CO 



CO o 

ID CM 
O 

o o 



CO CO 

'co *(0 

>« 

15 To 

c c 

CO CO 

;^ CD 

CD O 

ID ^ 

o o 



ID 5 
CM LU 
CO 



< 

10 
a> 
E 

<D 

E 



E 
<1> 

1 § 

es 
if 

CO m 
CD T3 

i ^ 

$ CO 



CO 

o 
o. 



CO 
CD 



o 

CO 



o. 

CO 

'<o 

CO 

c 

CD 

00 

CO 



«^ S o 

CO ID O 
^ JO « ^' 



5 € 

CO 5 

> o 

CO — 

o J 

I- <D 

tS o 

C Q> 

0 u. 

1 3 

CO CO 



to ^ 

p 

CO 

CO E 

t.i 

O <0 

8 I 

CO CO 



CO o 



CO 0) 
ID CO 
CM CO 



CO CD 

c c 

CO CO 

CM O) 

CO o> 

CO CM 

o o 



CD CD 

c c 

CO CO 

ID CD 

O) CO 

CM CO 

o o 



CO CO 

c c 

CO CO 
CO T- 

a> CM 

5 o 

o o 



ZZZZZZZZ CO z z z 



CO. 



COCOCDCOCOnSCOCOCOCOcO H— ' CD ^ CO I 

o>' ^' «>' «' ^' ^' oo' cm' cm' «' h-' «' ""1 r.' ""i 

O>^'«-0OlOC0COO>'»-CMCDlDT-r*»lD-^ 
CMCMCM'T-CMh-COT-'^COCMCMCD 



_ ^ _ . R S 

oo^-h^r^r^co^ooc^oo^cM^T-op 

CO CO CO CO CO CO CO T— 



CO ^ CO CO CO CO 



CM 



^ CO CO 



CO. ^ 
CO. 

S o'^ 9 s s ^ 

CO O O h- CO 



- "l 

CO. CD C3> 

"t- CO 

CM CO 

~ ID 00 



CM-^CO-^-f-T-OOCOCO 



CO. CO. 



CO. CO. 



00 S CM S CO CD 

▼7 ID q> a> ^ ID 



CO CO CO 



CO 



CO 

c 

0> 

o> 
2 
■o 
>» 
x: 
o> 

"O 
(V 

c 

a: 

CO >s 
Q. 

•if 



0) O 
CO o 

CO Z 









CO 








c 








a> 












osom 




andrc 




LUOJ 


un 


O 






c 


















> 


CM 


« 

Kl 




< 


ma 


mati; 




%— 


o> 




o 


*<o 






c5 
*2 




CO 




X 


X 




on 


pie 


X 






E 






.2 








8 


E 




"E 








! uoi] 


c 

O) 

2 


sub1 


CM 
C 


(0 

CO 


o. 


o 


ondi 


tran; 


lated 


a> 


osp 


otic 


r-re 


E 


mb 


ary 


pto 


j= 
o 






CO 


o 






T3 






0} 


CO 




CO 


CO 


<o 


CO 


CO 


'cO 


'v> 


CO 


>s 






>s 


CO 


CO 


(0 


CO 


c 


c 


c 


c 


CO 


CO 


CO 


CO 






CD 


CO 




5 




o 




CD 


5> 




CO 


s 


CO 


o 


o 


O 


o 


o 












%• 


%' 


z 


Z 


Z 


z 



6. 

(D 
CO 

O 

E 
o 

CO 
=3 
O 

o> 
o 
o 

CM E 
o ^ 

^ < 

9 ^ 
'co 1^ 

CO o 

Q. O 
g «> 

-I 



<1> CO 

E CO 
CO 

CO 'co 
2^ 

CO CO 

c c 

CO CO 
CM 

CO CO 
CJ> 

CD CM 

'f- O 



O g CO ^ 

CnI CO 00 o S 

CO to - 

CO CO <D 



C0| C0| C0| C0| C0| 
h- CO 0> C30 

^ - CO CO 
_ ^_ o m CO 

O ^ OO CO CO 
CO CO CO 



1^ 

CO 



s 



c 

CO 
<D 



^? 

CO 

.l| 

il 

<D O 

52 

<1> O 

1 € 

CO CO 

"co "co 
2^ ^ 
CO CO 

c c 

CO CO 

o CO 
o o 

CD O 
O 

o o 

55 



? 8 



CO 

E 
o 

CD 



CO 



o 
c 

CO 

o 



o 

LU 

% B 

CO 

p ^ 

a. o 

CD i5 

_ C 3 

0> CO g> 

^ c 

E c 

a> p 

E S 

CO t= 



O 

c 



CO 
(D 
Q. 

9> 

CM y 

E E 

li 

.SS <2 
CM 

g g 

CO C7> 

x: CO 
CO 

1| 
fi 

C CM 
CO c 

a> E 

52 o 

& .52 

S'i 

Z tS 
CD 45 
^2 a> 



(o ^ 



CM 

O 
DC 
O 

CO CO CO 

"co CO 'co 

^ -2^ ^ 

CO CO CD 

c c c 

CO CO CO 
C3> CM 

00 M- 

O CD T- 

CM CO CD 

o o o 

o o o 

_l _l _l 



s 

c <^ 

CM Q. 

-> -o 

CO CO 

'co "co 

>» ^ 

CO CO 

C= c 

CO CO 

CO o> 

O CM 

o 

s g 



CO 



z 

O 00 
to CD 

— o 

Q. 

o> 
c 
<p 



CO 
CM 

S o 

is 

o ^ 



o 
o 

£ 
o 



•5^ 



£ 2 Q 



CO 

a> 
o. 

a> 

o 5! 

o c 

il 

CO CO 

"co "co 
CO CO 

c c 

CO CO 
CO CO 
CM 
CM CM 
CO O 

o o 

55 



o <i> 



O "F 

0) E 

Q. 0> 

< ^ 

0) o 

E 

>» 0) 

N CO 

C 3 

TO o 
o) E 

CO CO 

CO *co 
_>» 

CO CO 

c c 

CO CO 

?CM 

-r- CO 
CD CM 

o o 

55 

z z 



o c 
S 8 



CM g> 

Bo 

.il 

E-c 
■§ E 



CO 0) 

if 

CO 

CO a> 

CO 3 

c o 
(O -o 



o 

CM 



3 

o 

Q. 
CM 

c 

O) 

? 

o 



CO 

•c 



E 

CM 

o 

CO 
CO 

c 
a> 

p 



0) 
T3 



0> 

CO 

cu 
E 

o 

< 
z 

o 

— • k. 
<D 



TO 

>* 
c 

0) 

-o 

CO 

o 
g> 

o 

lO 
CM 
CO 

'co 
_>* 

CO 

c 

CO 
CD 

oo 

CO 



CD 

c 

CD 
O} 
CO 

o 
o. 



CO 
CO 



o 

CO 



a> 
c 

CD 
O) 

o 

a- 

o 
E 



a> 

x: 

CO 

o 



*2 
-o 
a> 

O — 

S o 

CO -r- 

co 



CO .1*; c 
^ o c 



z 

CD 
LU 

E 
o 



S5 

<D 

X3 
E 
a> 
E 



E 



~ <D 
CO 
CO 



o o 



c oo 

" CO 
CO 



s 



2 2 



<D 

c: 

(D 

cn 

_ o 
3. CO x: 

c^.g & 

E 



x: 



JO 



o 

X3 

CO 

'co 

CO 

cr 

CO 

o 
o 

CM 

O 
O. 



o 

CO 

*co 

CO 

c 

CO 

o> 

CM 
CD 
CO 



c 0) 

"O CO 
CD 

to 



o 

JO 

CO 

■ J. 

CO 

c 

CO 

o 
o 

CM 

o 
o. 



-I _l _l 



z z 



z 



X 

o ^ 

-Q CO 

I— CO 

CO CO 

*co *co 

CD CO 

c c 

CO CO 

;^ CD 

CO o 
lO 

o o 



C <D 
CD Q. 



^ CO 



CO 

o 



CO 



CO 

_ <D 

*D 5 C 
CM tU 5 

CO p S2 
'co E *co 
^ O ^ 

CD o CO 

c 25 c 

CO CO 
CM 

CO <D 0> 
lO CO CO 
CM CO CO 

Z CO Z 



CL 

si 

1.8 

(O if 



'co 

CO 

c 

CO 
CO 

CO 

0 S 

1 § « 

— CO _ 

.55 CO E 



i: £ ? S 



iS 1 

O O CO 

CO ul m 

CO CO CO 



CO CO CO CO CO 

c c c c c 

CO CO CO CO CO 
CD lO CO CO 'T- 
CJ> t- C3> CM 

o) o> CO o 

CM CM CO CD O 

o o o o o 

55555 



^^^^^^^^^^ ^ 
C0| C0| CO^ C0| CO^ C0| CO^ COj C0| COj CO^ CO. 



•»-• CO 
CO. I 
CO. ^ I CO, 



C3>^IOh-CO^"«-COCMCMCOf^lO I I 
O^^OOlOCOCOO)^ " 
CM CM CM CM 1^ CD " 



GO ^ 

CO - 



CM CO to 1^ CO 

SCO CM CM CO h- O 

00 ^ CM 3r ^ o o 

OCOCOCOCOOOCOCOCOCO^^CM 



CO. ^ 

«^ ^ I CO ^ 

^ CO CO CO I ^ CO CD 

^1 I I CO CO I I 

. — . I 1 CO, CD 0> 

I CO CM CD CO r- CO 
OCO'^OCOt-CMh-CO 

oor^T-co-r-oioco 
▼-CM'srcO'^^i-oococo 



CO CO CO CO CO 

cm' cj>' "'i ^' co' 

CD CO CO CM in 

oo oo CM CO CO CD 

to p> CD "5 lO 

CO O 



CO CO CO 



CO 

c 
<1> 
o> 
o 



< 

X— 

o 

c 
o 

.2 
'E 



CO 

c 

I 4 



■o 

0} 



"2 T5 " 



p 



T5 
CO 



(0 
O 



O 
O 
O. 



o 

CO ~ 

"i ^ 

£ <D 

CO CO 

*CO *CO 
>* 

CO CO 

c c 

CO CD 

Tl- cx> 

CM <0 

CO 

o o 



CO 

c: »^ 

=3 O 

3 .9 

CM 

E ? 

g> E 

'co 9 

CO 

■5.x 
E 

O c 

5 ■§ 

o w 

T3 

B Ql 

iS CD 

2 E 

CO o 

-s ^ 

CO CO 

*co *«0 

>» >• 

CO CO 

c c 

CO CO 
<0 CO 

o 

co o 
o o 

J5 

z z 



o 

CO 
=3 

o 
E 
o 

CO 

o 
o> 
o 
o 

CM E 

C CD 

O C 

'co 1^ 

CO o 

CD ^ 



CD CO 

P £ 

c CO 

CO CO 

*co 'co 

>* 

CO CO 

c c 

CO CO 
CM 

CO CO 

T- a> 

CO CM 

T- O 

z z 



CO CO 
o' Oi 16, 



r> CO o> CO 



CM lO I <35 CO CO 
CMCOOOOIOOIOCD 

COCOCD^-^COCOCO 



CO 
TO 

r 

•e 
S 

a> 
■o 
in 
6 



CM 
O 

TO 
C 

"to 
^ 
o 
'o. 
2 
o 
.9 
o 

= 8 

a> Q> 

-I 

< a> 

J. J. 

TO TO 

c c 

TO 



TO 

CO o> 

O T- 

CsJ CNJ 



o 

8 

a> 

Q. 

? 

c 

" o 
E 

iS O 

o 

S.i5 

CO SI 

C o) 
■> > <D 

eg Q. 
2l 



Q. 

I ^ i 
.E .c Q. 

E 

CO 



TO 



TO 

CO 



ill 

Q. <. Z 

uj S a: 

CO CO to 
'co 'co 'co 
>» Sk ^ 

TO TO TO 

c c c 

TO TO TO 

T- m o> 

ID CO O 

0> to CO 
ID 

O O ^ 



Q. 
CO 



o 

S 

cx 
o> 

c 
x> 

'co 

E 
to 

*CO 
TO 

c 

TO 
CO 
CO 
lO 

s 



TO 

E 
o 
c: 
_ro 

0) 

E 



TO 

■o 
a> 
to 

CO 

Q. ^ 
X — 

a> 



TO 



(D 

E 

o o> 
o o 
o 

0> £ 
"D O 

.55 TO 
o c 
o <i> 

CO ^ 

to o 

t- 

O c 

^ o 

CO CO 
>s C 
•O TO 

Q ^ 

o§ 8 

^ 0) 

TO to 

0) i5 



o 
o 

^ is i 

TO 

CD Q> 

— HI >* 

o O ^ 



a. 

Q. 



TOO. " 

g 

0) p 



8 



O) (V 



E 

CO 

O 
ID 

CL 
<D 

E 
p 



g 3 

2^ S 



8 



8 § 

TO C 

0) 8 



Q. CO 
to CO 

'co 'co 
-2^ ^ 

TO TO 

c c 

TO TO 

to CO 

^ CO 

T- C3> 

CO CM 

O O 



CO Q. 

CO CO 

"co CO 

TO re 

c c 

TO TO 

rg CO 

CO a> 

CM T~ 

o 

o o 

o o 



CO CO 

CO "co 

TO TO 

C C 

TO TO 

CM in 

O CO 
CM 

CM O 

O O 

O O 



z z z z z z 



O -2> 
O o 

to CO 

"co "co 

_>s _>* 
TO TO 

C C 
TO TO 
CO 

^ in 
T- in 

lO CM 

o o 
Z Z 



2 
o 



TO 

to 
2 

to 



s I 

^ E 

8 >• 

^ 1 

5 = 

S « i2 

O 2o <D 

<i> i5 ^ 

^ a> 

£ «1> TO 

g g 1 

a> £ CO 

.S TO S 

3 to Q. 

to CO CO 

"co "co *co 
_>»_>» ^ 

TO TO TO 

C C C 

TO TO TO 



CO 
CO 

o 
o 
o. 



^ CO 

in tt 

CM 
CM CM 
O O 



TO 
X2 



2 

S 



s 

o o 
B 



TO 



to 
o 

TO 



TO 

2 

CL 

s 



o c 

IT T3 

CO 3 

TO T5 

TO — 

TO ^ 
-£ 

to c 

0 o 

■£ «5 

Q. _TO 

C ^ 

1 I 

CO CO 

'co 'co 

TO TO 

c c 

TO TO 

CM ^ 

CO CM 

<<- CO 

-r- O 

CM O 



o 

h«- CO 

tv- <l> 

o 2 

Z "to 
S o 

ii 

TO "O 



ts 



TO 
C 
TO 



2 ™ 

Ql X 

— o 

S 5 

a> CO 



I 

CO 



o >* >* 

a. TO TO 
c c 

^ TO TO 

h- in CO 

00 CO 00 
1^ 00 
CM 

III 



TO 
3 

8 



TO 
O 
Q. 
O 

5 

CL 

8 

•o 

TO 
CL 
3 

8 
I 

Q, 
I 

o 

TO 
> 



CL 

t3 

3 



TO 
C 
TO 



O 
Q 



O 

3 
O 
9> 



CM 

TO 



I 

Q. 
O 



<1> o 



^8 = 

W 3 to 

to -p^v 'to 

TO Q. TO 

C Q C 

TO ^ TO 

C35 m 

o 'n 

00 *^ CM 
w CO 

^ o o 
o Uo 

z £z 



TO 00 

II 

8 I 

CO CO 

'co "co 
_>* 

TO TO 
c c 

TO TO 
CM O 

in CJ> 
CO m 

ss 

z z 



TO 



TO 

CM CO 



TO 



TO. 



^1 TO TO. TO, TO TO TO 



TO 0> TO TO TO TO TO I w «v w «w «w «v vw w I CO 

TO. ^ o CO* r^' TO. 10 co' 10' o' *"| -^t' in m m' o' cm' co' 00' o' cm' m' ess' 
cococo^i^ oocM'^o>ooioiooo^mr^ cooo^io — 
_- ^. _i-cMCMcocj>cor^c3>cMincj>mcO'»-inoo'«-^'^'srr-.'^co 
coo>^cooo(MN>^incMr^oo>r^oooocoroooinco^o>oco 
CO CO CO o> CO CO cO ^ CO co ^ CO ^ CO co co co co co 



"'l J ^'l « 

10 (33 CM 

CO ^ in CO 



r-cococMcoin-^coco t- 



TO •«-• 
I TO. TO. 

le s 

CO O CM 

CO 



CO 

Si 

O 



CO 
o> 
o 
o 



o o 
c •«= 
a> 0) 

O) c 

.9 c 

CD CO 
r- "O 

o S 

V ^ 

o >N 

.E 5 

(1> <A 

& £ 

M (O 
CO <0 

c c 

CO CO 

SCO 
in 
in CO 

T- CM 



O 



JO o 



</> 
O 
O 

c 
o 

CO 



o. 
o 

CO 
P 



c 
a> 
<o 

CO 
CO 
<D 

a> 

CO 



CO 



CO 

o> o 
a> <j> 

CO T- 

o o 

5 



^ ^ CO «^ ^ 

CO CO^ CO, I ^ CO. CO. 
0> CO 

CM ^ I I CO CO 

^ h«- CO 0> 00 t*- o> 

O 1^ O CO 

in GO CNj CO CO in CO 

CO CO CO T- ^ CO CO 



EXAMPLE VL 

Application of ANOVA to Vxinsight Clusters to Identify Genes Associated 

with Outcome 



To identify genes strongly predictive of outcome in pediatric ALL, we 
divided the retrospective POG ALL case control cohort (n=254) described 
above into training (2/3 of cases) and test (1/3 of cases) sets performed 
statistical analyses using Vxinsight and ANOVA. Through this approach, we 
identified a limited set of novel genes that were predictive of outcome in 
pediatric ALL. Table 20 provides the list of the top 20 genes associated with 
remission vs. failure in the pre-B ALL cohort; several of these genes appear to 
reach statistical significance. These top 20 genes are ranked by ANOVA f 
statistics; we have also converted these f statistics to corresponding p values. 
Not surprisingly, overall p values for outcome prediction in Vxinsight or with 
any other method are less than for prediction of genetic types or morphologic 
labels; we assume that this is due to the significant biologic heterogeneity of the 
outcome variable in our patient cohorts. A positive value in the "Contrast" 
column of Table 20 reveals that the gene identified is expressed at relatively 
higher levels in patients in long term remission; a negative value indicates that a 
particular gene is expressed at lower levels in patients in remission and at higher 
levels in patients who fail therapy. 
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Table 20: Genes Statistically Distinguishing R mission vs. Fail: Vxinsight 



Order 


ANOVA_F 


nsiORF 


Contrast 


P 


Description 


1 


26.58 


39418 at 


-2279.06 


p<=0.024 


DKrZP564M182 
protein 


2 


18.95 


37981 at 


2461.77 


p<=0.046 


drebrin 1 


3 


18.87 


38971 r at 


-1874.42 


p<=0.057 


Nef-associated factor 1 


4 


18.82 


38119 at 


-2515.9 


p<=0.074 


glycophorin C isoform 2 


5 


17.18 


671 at 


-1340.48 


p<=0.068 


secreted protein acidic 
cysteine-rich osteonectin 


6 


16.74 


577 at 


3653.53 


p<=0.125 


midkine neurite growth- 
promoting factor 2 


7 


16.05 


37343 at 


3009.04 


p<=0.122 


inositol 14 5- 
triphosphate receptor 
type 3 


8 


14.37 


1126 s at 


-2870.22 


p<=0.177 


Human cell surface 
glycoprotein CD 44 gene, 
3' end of long tailed 
isoform 


9 


14.33 


32970 f at 


1440.29 


p<=0.127 


hyaluronan binding 
protein 


10 


13.83 


41185 f at 


1446.05 


p<=0.190 


SMT3 suppressor of mif 
two 3 yeast homolog 2 


11 


13.78 


33362 at 


-1537.08 


p<=0.175 


Cdc42 effector protein 3 


12 


13.74 


38652 at 


1811.99 


p<=0.029 


NM_017787 hypothetical 
protein FLJ20154 
NM_017787 hypothetical 
protein FLJ20154 


13 


13.31 


824 at 


-2173.7 


p<=0.160 


glutathione-S- 
transferase like 
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Order 


ANOVA_F 


nsiORF 


Contrast 


P 


Description 


14 


13.28 


35796 at 


-1815.29 


p<=0.243 


protein tyrosine kinase 
9-like A6-related protein 


15 


13.06 


40523 at 


1523.7 


p<=0.178 


hepatocyte nuclear 
factor 3 beta 


16 


13.06 


37184 at 


-2181.49 


p<=0.151 


syntaxin lA brain 


17 


13.04 


34890 at 


-1087.46 


p<=0.195 

• 


ATPase H transporting 
lysosomal vacuolar 
proton pump alpha 
polypeptide 70kD 
isoform 1 


18 


12.94 


41257 at 


-1030.55 


p<=0.155 


calpastatin 


19 


12.86 


41819 at 


1020.59 


p<=0.264 


FYN-binding protein 
FYB-120/130 


20 


12.71 


32058 at 


1413.3 


p<=0.214 


HNK-1 sulfotransferase 



Interestingly, OPALl/GO (38652_at; NM_Hypothetical protein FLJ20154); see 
Example II), at position 12 on the table, appeared on gene lists produced by four 
different supervised learning algorithms (Bayesian networks, SVM, Neurofuzzy 
logic) and was ranked extremely high (top 5 or 10 genes) or at the top 
(Bayesian) with each of these very distinct modeling approaches. The degree of 
overlap between outcome genes detected with these different modeling 
algorithms was quite striking. 

The gene at the number 5 position on the table (Affy number 671_at, 
known as SPARC, secreted protein, acidic, cysteine-rich (osteonectin)) is 
interesting as a possible therapeutic target. Osteonectin is involved in 
development, remodeling, cell turnover and tissue repair. Because its principal 
functions in vitro seem to be involved in counteradhesion and antiproliferation 
(Yan et al., J. Histochem. Cytochemi. 47(12): 1495-1 505, 1999). These 
characteristics may be consistent with certain mechanisms of metastasis. 
Further, it appears to have a role in cell cycle regulation, which, again, may be 
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important in cancer mechanisms. Furthermore, it should be noted that other 
significant (about p<0.10) genes on the list might also have mechanisms that, 
together, could be combined to suggest mechanisms consistent with the 
observed differences in CCR and FAILURE. The group of genes, or subsets of 
5 it, may have more explanatory power than any individual member alone. 

EXAMPLE VII 

Genes That Distinguish Karyotype Identified by Bayesian Methods 

10 In the context of disease karyotype subtype prediction, we applied 

Bayesian nets to the preB training set data in a supervised learning environment. 
A set of training data, labeled with disease karyotype subtype, is used to 
generate and evaluate hypotheses against the test data. The Bayesian net 
approach filters the space of all genes down to K (typically, K bewteen 20 and 

15 50) genes selected by one of several evaluation criteria based on the genes' 
potential information content. For each classification task attempted, a cross 
validation methodology is employed to determine for what value of K, and for 
which of the candidate evaluation criteria, the best Bayesian net classification 
accuracy is observed in cross validation. Surviving hypotheses are blended in 

20 the Bayesian framework, yielding conditional outcome distributions. 

HjTJOtheses so learned are validated against an out-of-sample test set in order to 
assess generalization accuracy. 

Approximately 30 genes from prediction of each karyotype were 
combined. The gene list in Table 21 can discriminate translocations of t(12;21), 

25 t( 1 ; 1 9), t(4; 1 1 ), t(9;22) as well as hyperdiploid and hypodiploid karyotype from 
normal karyotype. 

Table 21. Genes for karyotype distinction derived from Bayesian Analysis of 
30 pediatric ALL microarray samples 

Affymetrix ID Gene description 

35362_at hg01449 cDNA clone for KIAA0799 has a 1204-bp insertion at position 
373 of the sequence of KIAA0799, 
35 1325_at Sma and Mad homolog 

1077_at recombination activating protein 

100 



10 



15 



20 39824 at 



34194_at Source: Homo sapiens mRNA; cDNA DKFZp664B076 (from clone 
DKFZp564B076). 

32730_at Source: Homo sapiens mRNA; cDNA DKFZp564H142 (from clone 
DKFZp564H142). 

34745_at Source: Homo sapiens clone 24473 mRNA sequence. 
37986_at Source: Human erythropoietin receptor mRNA. complete cds. 
40570_at Source: Homo sapiens forkhead protein (FKHR) mRNA, complete cds. 
40272_at Source: Homo sapiens mRNA for dihydropyrimidinase related protein- 
1 , complete cds. 

2036_s_at Source: Human cell adhesion molecule (CD44) mRNA, complete cds. 
35940_at Source: H.sapiens mRNA for RDC-1 POU domain containing protein. 
41097_at telomeric protein 
39931_at dual specificity protein kinase 

31472_s_at hyaluronan-binding protein; soluble isoform CD44RC; alternatively 
spliced 

32227_at hematopoetic proteoglycan core protein (AA 1 - 158) 
37280_at Mad homolog 

36524_at hj05505 cDNA clone for KIAA1 112 has 983-bp and 352-bp insertions 
at the positions 820 and 1408 of the sequence of KIAA1 112. 
Source: tg16b02.x1 NCLCGAP_CLL1 Homo sapiens cDNA clone 
IMAGE:2108907 3'. mRNA sequence. 
35260_at Source: Homo sapiens mRNA for KIAA0867 protein, complete cds. 
35614_at Source: Homo sapiens TCFL5 mRNA for transcription factor-like 5. 
complete cds. 
25 37497_at orphan homeobox gene 

41814_at alpha-L-fucosidase precursor (EC 3.2.1.5) 
1980_s_at Source: H.sapiens RNA for nm23-H2 gene. 
36008_at potentially prenylated protein tyrosine phosphatase 
36638_at Source: H.sapiens mRNA for connective tissue gro\Arth factor. 
30 40367_at bone morphogenetic protein 2A 

32163_f_at Source: zq95f07.s1 Stratagene NT2 neuronal precursor 937230 Homo 
sapiens cDNA clone IMAGE:649765 3' similar to contains LTR7.b3 
LTR7 repetitive element ;, mRNA sequence. 
755_at Source: Human mRNA for type 1 inositol 1 ,4,5-trisphosphate receptor. 

35 complete cds. 

32724_at Refsum disease gene 

39327_at similar to D.melanogaster peroxidasin(U1 1052) 
39717_g_at Source: tn15f08.x1 NCLCGAP_Brn25 Homo sapiens cDNA clone 
IMAGE:2167719 3*. mRNA sequence. 

Source: vicpro2.D07.r conorm Homo sapiens cDNA 5'. mRNA 

TALE homeobox protein 
beta-galactoside>binding lectin 
basic helix-loop-helix transcription factor 

Source: Human gene for very low density lipoprotein receptor, exon 

Source: Human cyclin A1 mRNA, complete cds. 
Source: H.sapiens p63 mRNA for transmembrane protein. 
Source: Human placenta (Diff48) mRNA, complete cds. 
c-myc oncogene 

Source: qf71b11.x1 Soares_testis_NHT Homo sapiens cDNA clone 
IMAGE:1755453 3' similar to gb:M38591 CALPACTIN I LIGHT CHAIN 
(HUMAN);, mRNA sequence. 
1 973_s_at c-myc oncogene 

31444_s_at Source: Human lipocortin (LIP) 2 pseudogene mRNA. complete cds- 
like region. 

36897_at Source: Homo sapiens mRNA for KIAA0027 protein, partial cds. 
34210_at Source: zbllblO.sl Soares_fetalJung_NbHL19W Homo sapiens 

cDNA clone IMAGE:301723 3* similar to gb:X62466 H.sapiens mRNA 
forCAMPATH-1 (HUMAN);, mRNA sequence. 



40 3341 2_at 
sequence. 
40763_at 
31575_f_at 
1039_s_at 
36873_at 
19. 

1914_at 
32529_at 
32977_at 
37724_at 
39338 at 
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50 



55 



60 
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15 486 at 



32232 at 



20 33355 at 



36203_at 

37306_at 

1081_at 

40454_at 

1616_at 

36452_at 

35727 at 



266_s_at Source: Homo sapiens CD24 signal transducer mRNA, complete cds 
and 3' region. 

769_s_at Source: Homo sapiens mRNA for lipocortin II, complete cds. 
36536_at Source: Homo sapiens clone 24732 unknown mRNA, partial cds. 
5 3841 3_at Source: Human mRNA for DAD-1 , complete cds. 

41 170_at Source: Homo sapiens mRNA for KIAA0663 protein, complete cds. 
37680_at kinase scaffold protein 

3851 8_at Source: Homo sapiens mRNA for SCML2 protein. 

36514_at Source: Human cell growth regulator CGR19 mRNA, complete cds. 
10 40396_at ionotropic ATP receptor 

4041 7_at KIAA0098 is a human counterpart of mouse chaperonin containing 
TCP-1 gene. Start codon is not identified. ha01413 cDNA clone for 
KIAA0098 has a 2-bp insertion between 736-737 of the sequence of 
KIAA0098. 

prodomain of this protease is similar to the CED-3 prodomain; 
proMch6 is a new member of the aspartate-specific cysteine protease 
family 

Source: Homo sapiens NADH-ubiquinone oxidoreductase subunit Cl- 
SGDH mRNA. complete cds. 

Source: Homo sapiens mRNA; cDNA DKFZp586J2118 (from clone 
DKFZp586J2118). 

Source: Human gene for ornithine decarboxylase ODC (EC 4.1.1.17). 
ha 1025 is new 
ornithine decarboxylase 
Source: H.sapiens mRNA for hFat protein. 
Source: Human mRNA for FGF-9. complete cds. 
Source: Homo sapiens mRNA for KIAA1029 protein, complete cds. 
Source: qj64d06.x1 NCI_CGAP_Kid3 Homo sapiens cDNA clone 
IMAGE:1864235 3* similar to WP:F19B6.1 CE05666 URIDINE KINASE 
;, mRNA sequence. 

Source: Homo sapiens mRNA for osteonidogen. complete cds. 
Source: H.sapiens PBXIa and PBXIb mRNA, complete cds. 
CDK inhibitor pi 9 

Source: H.sapiens mRNA for protein kinase C zeta. 
Source: Homo sapiens mRNA for ADP ribosylation factor-like protein, 
complete cds. 

Source: Homo sapiens mRNA for GS3955, complete cds. 
protein tyrosine kinase 

Source: Homo sapiens mu-crystallin gene, exon 8 and complete cds. 
Source: Human MIC2 mRNA, complete cds. 
Source: Homo sapiens mRNA for GS3955, complete cds. 
Source: Homo sapiens mRNA for KIAA0456 protein, partial cds. 
inducible protein 

similar to ankyrin of Chromatium vinosum. 

hh01783 cDNA clone for KIAA0802 has a 152-bp insertion at position 
2490 of the sequence of KIAA0802. 
alternatively spliced 

Source: Human signaling lymphocytic activation molecule (SLAM) 
mRNA, complete cds. 

Source: Human natural killer cell enhancing factor (NKEFB) mRNA. 
complete cds. 

Source: yj49e08.r1 Soares placenta Nb2HP Homo sapiens cDNA 
clone IMAGE:152102 5', mRNA sequence. 
1788_s_at MAP kinase phosphatase 
55 39929_at Source: Homo sapiens mRNA for KIAA0922 protein, partial cds. 
37701_at also called RGS2 

34335_at Source: wi81c01.x1 NCLCGAP_Kid12 Homo sapiens cDNA clone 

IMAGE:2399712 3'. mRNA sequence. 
1636_g_at ABL is the cellular homolog proto-oncogene of Abelson*s murine 
60 leukemia virus and is associated with the t9:22 chromosomal 
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753_at 

32063_at 

1797_at 

362„at 

39829_at 

717_at 

854_at 

38285_at 

41138_at 

40113_at 

36069_at 

37579_at 

37225_at 

39614_at 

38748_at 
33513 at 



50 39729 at 



37493 at 
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39730_at 
37006_at 

33131_at 
36031_at 
38968_at 
40202_at 
38119_at 

36601_at 

32260_at 

34550_at 

37399_at 

38994_at 

1583_at 

1461_at 

33885_at 
34889_at 

40790_at 
38276_at 
36543_at 
36591_at 
37600_at 

1295_at 
37732_at 

669_s_at 
cds. 

3831 3_at 
35256_at 

35688_g_at 
32139_at 
40296_at 
149 at 



32251_at 

37014_at 
1272 at 



40771. 


_at 


32941 


at 


37001 


at 


37421. 


_f_at 


39755 


at 


33936. 


.at 


40370 


f at 



translocation with the BCR gene in chronic myelogenous and acute 
lymphoblastic leukemia; alternative splicing using exon 1a 
p150 protein (AA 1-1130) 

Source: >a^23c07.x1 Soares_Dieckgraefe_colon_NHUC Homo sapiens 

cDNA clone IMAGE:2351436 3', mRNA sequence. 

Source: H. sapiens mRNA for SOX-4 protein. 

Source: Homo sapiens mRNA for p33, complete cds. 

This protein preferentially associates with activated form of Btk(Sab). 

three-times repeated zinc finger motif 

Source: Human mRNA for erythrocyte membrane sialoglycoprotein 

beta (glycophorin C). 

vinculin 

Source: H. sapiens mRNA for major astrocytic phosphoprotein PEA-15. 

Source: Human mRNA for D-1 dopamine receptor. 

Source: Human mRNA for KIAA01 19 gene, complete cds. 

similar to product encoded by GenBank Accession Number AB004g03 

Source: Human tumor necrosis factor receptor mRNA, complete cds. 

Source: Homo sapiens MAD-3 mRNA encoding IkB-like activity, 

complete cds. 

Source: Homo sapiens mRNA for KIAA0907 protein, complete cds. 
Source: zk81f02.s1 Soares_pregnant_uterus_NbHPU Homo sapiens 
cDNA clone IMAGE:489243 3\ mRNA sequence, 
basic helix-loop-helix protein 

Source: Human I kappa B epsilon (IkBe) mRNA, complete cds. 

tissue factor versions 1 and 2 precursor 

Source: Human HALPHA44 gene for alpha-tubulin, exons 1-3. 

Source: Human extracellular matrix protein 1 mRNA, complete cds. 

675_at Interferon-inducible protein 9-27 

putative 

Source: Homo sapiens mRNA; cDNA DKFZp564E1922 (from clone 
DKFZp564E1922). 

Source: Homo sapiens interferon regulatory factor 1 gene, complete 

Source: Homo sapiens mRNA for KIAA1062 protein, partial cds. 
Source: Homo sapiens mRNA; cDNA DKFZp434F152 (from clone 
DKFZp434F152). 

Source: H.sapiens MTCP1 gene, exons 2A to 7 (and joined mRNA). 

Source: H.sapiens mRNA for ZNF185 gene. 

match: proteins 043895 Q95333 Q07825 015250 054975 

DEAD-box family member; contains DECD-box; similar to rat liver 

nuclear protein p47 (PIR Accession Number A42881) and D. 

melanogaster DEAD-box RNA helicase WM6 (PIR Accession Number 

S51601) 

Source: zl25h05.s1 Soares_pregnant_uterus_NbHPU Homo sapiens 
cDNA clone IMAGE:503001 3\ mRNA sequence. 
p78 protein 

Source: Human translation initiation factor elF-2 gamma subunit 
mRNA, complete cds. 

match: proteins: Sw:P26038 Tr:035763 Sw:P26041 Sw:P26042 
Sw:P26044 Sw:P35241 Sw:P26043 Sw:P15311 Sw:P31976 
Sw:P26040 Tr:Q26520 Tr:Q24788 Tr:Q24796 Tr:Q94815 
Source: Homo sapiens DNA-binding protein mRNA, complete cds. 
Ca2-activated 

Source: Human DNA sequence from clone RP3-377H14 on 
chromosome 6p2 1.32-22.1, complete sequence, 
match: proteins: Sw:P 17861 Tr:035426 

Source: Homo sapiens DNA for galactocerebrosidase, exon 1 7 and 
complete cds. 

Source: Human lymphocyte antigen (HLA-G1) mRNA, complete cds. 
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32788. 


.at 


This giant protein comprises an amino-terminal 700-residue leucine- 
rich region, four RanBPI -homologous domains, eight zinc-finger motifs 
similar to those of NUP153 and a carboxy terminus with high homology 
to cyclophilin. 


5 


34990 


at 


isolated by yeast two-hybrid screening 




36927. 


.at 


The submitters designated this product as GS3686 




2031_s_at 


Source: Human wild-type p53 activated fragment-1 (WAF1) mRNA, 








complete cds. 




40518 


at 


precursor polypeptide (AA -23 to 1 120) 


10 


38336. 


.at 


hj06791 cDNA clone for KIAA1013 has a 4-bp deletion at position 
between 1855 and1860 of the sequence of KIAA1013. 




39059 


at 


D7SR 




547 s 


at 


NGFI-B/nur77 beta-type transcription factor homolog 




36048 


at 


Source: Homo sapiens HRIHFB2436 mRNA. partial cds. 


15 


33061 


at 


Source: Homo sapiens C16orf3 large protein mRNA, complete cds. 




40712 


at 


CD156; ADAM8; MS2 




39290. 


.Lat 


Source: 44c1 Human retina cDNA randomly primed sublibrary Homo 
sapiens cDNA, mRNA sequence. 




35408 


i at 


Source: Human mRNA for zinc finger protein (clone 431). 


20 


36103. 


.at 


Source: Homo sapiens gene for LD78 alpha precursor, complete cds. 
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Example VIII. 

Disciminant Analysis of Pre-B ALL Cohort Data to Discriminate Between 
Remission and Failure and Among Various Karyotypes 



30 



Classification tasks and the class labels 

We used supervised learning methods to discriminate between positive 
and negative outcomes (Remission (CCR) vs. Failure) and to discriminate 
among various karyotypes. The outcome statistics for the 167 member "training 
set" derived from the 254 member pre-B ALL cohort are shown in Table 22. 
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Table 22. Class Labels for Outcome Prediction 



Label 


Class Name 


# of Samples in the Clziss 


1 


CCR 


73 


2 


Failure 


94 



104 



To discriminate among the various karyotypes, we considered three 
different classifications of the karyotypes (Table 23). 



Table 23. Class Labels for Karyotype Discrimination 



No. 


Karyotype 


Class 


# of Samples in the 






Labels 


Class 


1 


T(12;21) 


1 


24 


2 


T( 4; 11) 


2 


14 


3 


T( 1; 19) 


3 


21 


4 


T( 9; 22) 


4 


10 


5 


Hyperdiploid 


5 


17 


6 


Hypodiploid 


4 


2 


7 


Normal 


6 


65 


8 


Unknown 


7 


14 



Data preprocessing 

The analysis was performed on the data set comprising the 167 training 
cases. We first eliminated the 54 of 67 control genes (those with accession ID 
10 starting with the AFFX prefix), and then eliminated those genes with all calls 
"Absent" for all 167 training cases. With these genes removed from the original 
12625, we were left with 8582 genes. In addition, a natural log transformation 
was performed on 8582 x 167 matrix of the gene expression values prior to 
further analysis. 

15 

Ranking genes 

The 8582 genes are ranked by two methods based on ANOVA for each 
classification exercise. Method 1 ranks the genes in terms of the F-test statistic 
values. Method 2 assigns a rank to each gene in terms of the number of pairs of 
20 classes between which the gene's expression value differs significantly. Note 
that for binary classification problem (remission vs. failure), only Method 1 is 
applicable. 
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Discriminating among the classes 

An optimal subset of prediction genes is fiirther selected from top 200 
genes of a given ranked gene list through the use of stepwise discriminant 
analysis. Then the classes are discriminated using the linear discriminant 
5 analysis. The classification error rate is estimated through the leave-one-out 

cross validation (LOOCV) procedure. A visualization of the class separation for 
each classification is produced with canonical discriminant analysis. 

Discrimination between Remission and Failure 

10 The one way ANOVA (F-test, which is equivalent to two-sample /-test 

in this case) was performed for each of 8582 pre-selected genes and then the all 
these genes were ranked in terms of the /7-value of F-test. The numbers of 0.05 
and 0.01 significant discriminating genes are 493 and 108, respectively. The top 
20 significant discriminating genes are tabulated in Table 24. An optimal 

1 5 subset of discriminating genes were selected from the top 200 genes using the 
stepwise discriminant analysis was also prepared. The number one significant 
prediction gene in both the ranked gene list and the optimal subset of prediction 
genes is 38652_at, hypothetical protein FLJ20154, corresponding to 
OPALl/GO. 

20 The optimal subset of discriminating genes was utilized with linear 

discriminant analysis to predict for Remission (CCR) vs. failure in the training 
set of 167 cases. The success rate of the predictor is estimated in three ways: 
Resubstitution, LOOCV with Fold Independent prediction genes, LOOCV with 
Fold dependent prediction genes, and the results are listed in Table 25. 
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Discrimination among various karyotypes 

The one way ANOVA (F-test) and the pair-wise comparison /-test were 
performed for each of 8582 pre-selected genes for the karyotype classification 
problem. Next, all genes were ranked based on the two methods described for 
outcome discrimination. The top 20 genes in each of ranked gene lists are listed 
in Tables 26 and 27. The tables also list the values of the statistic F and the 
number of pairs of classes between which the gene expression value differs at 
confidence level a=0.10, which is labeled as SIG#. An optimal subset of 
discriminating genes for each of the classes was selected from the top 200 genes 
with the stepwise discriminant analysis. 

Each optimal subset of discriminating genes was utilized with linear 
discriminant analysis to predict for the corresponding classes in the training set 
of 167 cases. The success rate of the predictor is estimated in the same way as 
described in above for outcome prediction and the results are listed in Table 28. 
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Table 26. Top significant discriminating genes for karyotype. 
Genes selected by Method 1 



Rank 


Step- 
wise 


F 


p-value 


Sig# 


Probe Set 


Probe Set Description 


1 


1 


25.8207 


0.00000 


8 


33355_at 


Homo sapiens mRNA; cDNA 
DKFZp586J21 18 (from clone 
DKFZp586J2118) 


2 


1 


22.6173 


0.00000 


6 


36452_at 


synaptopodin 


3 


1 


20.7497 


0.00000 


11 


40272_at 


collapsin response mediator 
protein 1 


4 


1 


20.5471 


0.00000 


13 


34335_at 


ephrin-B2 


5 


0 


20.1257 


0.00000 


9 


32063_at 


pre-B-cell leukemia transcription 

factor 1 


6 


0 


18.1686 


0.00000 


10 


38285_at 


crystallin, mu 


7 


0 


17.4124 


0.00000 


14 


1325_at 


MAD (mothers against 
decapentaplegic, Drosophila) 
homolog 1 


8 


0 


16.4965 


0-00000 


9 


41097_at 


telomeric repeat binding factor 2 


9 


0 


16.1843 


0.00000 


15 


37280_at 


MAD (mothers against 
decapentaplegic, Drosophila) 
homolog 1 


10 


0 


15.8108 


0.00000 


6 


35362_at 


myosin X 


11 


1 


15.7074 


0.00000 


15 


33412_at 


lectin, galactoside-binding, 
soluble, 1 (galectin 1) 


12 


0 


15.4828 


0.00000 


14 


35940_at 


POU domain, class 4, 
transcription factor 1 


13 


1 


15.0498 


0.00000 


11 


1081_at 


ornithine decarboxylase 1 


14 


0 


14.3251 


0.00000 


12 


717__at 


GS3955 protein 


15 


1 


14.2303 


0.00000 


16 


40570_at 


forkhead box Ol A 
(rhabdomyosarcoma) 


16 


0 


14.0783 


0.00000 


14 


32977_at 


chromosome 6 open reading 
frame 32 


17 


0 


14.0752 


0.00000 


15 


37680_at 


A kinase (PRKA) anchor protein 
(gravin) 12 


18 


0 


13.9742 


0.00000 


12 


854_at 


B lymphoid tyrosine kinase 


19 


0 


13.8677 


0.00000 


6 


1077_at 


recombination activating gene 1 


20 


0 


13.7766 


0.00000 


17 


37343_at 


inositol 1,4,5-triphosphate 
receptor, type 3 
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Table 27. Top significant discriminating genes karyotype 



Genes selected by Method 2 



Kaiuc 


Step- 
wise 


17 

r 


p- value 


big w 


KroDe oei 


Probe Set Description 


1 


u 


13./ /oo 


u.uuuuu 


1 / 


3 /3*l3_ai 


inositol 1 ,4, 5 -triphosphate 
receptor, type 3 


z 


0 


13.43 1 3 


U.UUUUU 




1 CO o« 


inositol 1 ,4,5-triphosphate 

receptor, type 3 


3 


1 
1 


1 3.U /Oj 


U.UUUUU 


1 / 


n^iQ of 
3 / j3y ai 


KajvjL/o-iiKe gene 


A 
H 


U 


14.Z3U3 


U.UUUUU 


1 ^ 

lo 


4U J /u_ai 


forkhead box Ol A 
\^rnaDuoniyosarcoina ) 


5 


1 


13.0270 


0.00000 


16 


307 at 


arachidonate 5-lipoxygenase 


o 


0 


1 z.y /ZD 


A AAAAA 

U.UUUUU 


lo 


3o34U_ai 


huntingtin interacting protein- 
1 -related 


7 


0 


Iz. / /Zh 


A AAAAA 

U.UUUUU 


lo 


3zoz /_ai 


related RAS viral (r-ras) 
oncogene homolog 2 


Q 
O 


u 


1 1 .oyoi 


A AAAAA 

U.UUUUU 


1 o 


3oj3o_ai 


schwannomin-interacting 
protein 1 


9 


0 


11.4521 


0.00000 


16 


32554 s at 


transducin (beta)-like 1 


10 


0 


10.1yo3 


A AAAAA 

U.UUUUU 


lo 


3oodU at 


cyciin L/z 


11 


0 


10.1845 


0.00000 


16 


38968_at 


SH3 -domain binding protein 5 
(BTK-associated) 


12 


0 


10.0070 


0.00000 


16 


38518_at 


sex comb on midleg 
(Urosopniiaj-UKe z 


13 


0 


8.6339 


0.00000 


16 


37981 at 


drebrin 1 


14 


0 


7.6949 


0.00000 


16 


35794 at 


KIAA0942 protein 


15 


0 


16.1843 


0.00000 


15 


37280_at 


MAD (mothers against 
decapentaplegic, Drbsophila) 
homolog 1 


16 


1 


15,7074 


0.00000 


15 


33412_at 


lectin, galactoside-binding, 
soluble, 1 (galectin 1) 


17 


0 


14.0752 


0.00000 


15 


37680_at 


A kinase (PRKA) anchor 
protein (gravin) 12 


18 


0 


12.8180 


0.00000 


15 


675_at 


interferon induced 
transmembrane protein 1 (9- 
27) 


19 


0 


11.9668 


0.00000 


15 


39929 at 


KIAA0922 protein 


20 


1 


11.4160 


0.00000 


15 


38748_at 


adenosine deaminase, RNA- 
specific, Bl (homolog of rat 
REDl) 
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Table 28. Estimates of Prediction Success Rates for Karyotype Discrimination 



Task 


Estimation 
method 


Number of mis- 
classifications 


Overall Success 
Rate 


Gene selection 
method 1 


Resubstitution 


9 


0.9461 


FIPG LOOCV 


28 


0.8323 


FDPG LOOCV 


58 


0.6527 


Gene selection 
method 2 


Resubstitution 


10 


0.9401 


FIPG LOOCV 


30 


0.8204 


FDPG LOOCV 


55 


0.6707 



Example IX. 

. Uniformly Significant Genes that Are Correlated with CCR vs. Failure 

5 

The three data sets derived from the retrospective statistically designed 
254 member Pre-B data set were analyzed for their association with outcome: 
the 167 member training set, the 87 member test set and overall 254 member 
data set. Three measures were used: ROC accuracy A, F-test statistic and 

10 TNoM . Table 29 shows a list of genes correlated with outcome with the ranks 
determined by these different measures with the different data sets. 

Two genes were consistently significant in both training and test sets 
and they are number one and number two significant genes in the overall data 
set. The two genes are 39418_at, DKFZP564M1 82 protein (PBKl) and 

15 41819_at, FYN-binding protein (FYB- 120/1 30). FYN is a tyrosine kinast found 
in fibroblasts and T lymphocytes (Popescu et al.. Oncogene 1(4):449-451 
(1987)). 

Unexpectedly, although OPAL 1 /GO was the most significant gene in the 
training data set, it was a much less significant gene in the test data set. Indeed, 
20 most of the significant genes in training set, like OPAL 1 /GO, became less 

significant in test set. The fact that most genes that did well in the training set 
did poorly in the test set lends support to our hypothesis that the test set's 
composition differed significantly from that of the training set. We therefore 
sought to increase the robustness of this statistical analysis. 
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Re-sampling training and test data sets 

Our goal w£is to identify genes that are significant irrespective of the 
data set. One way to get a stable (robust) list of genes that are highly correlated 
with the distinction of CCR vs. Failure is through the use of a random re- 
5 sampling (bootstrap) procedure. We randomly divided the overall data set into 
training and test sets 1 72 times. The numbers of CCRs and Failures in the 
training set was fixed to agree with the original training set, (i.e. 73 CCR s and 
94 Failures). Each time the genes are ranked in the same way as in Table 1 . That 
is, we produced 172 tables like Table 29 for the 172 different training and test 
10 sets. 

We found that the gene ranking in the two data sets (training and test 
randomly resampled in each time) are typically quite different. However, in 
most runs, the two genes 39418_at (PBKl) and 4181 9_at (FYN-binding 
protein) were consistently significant in both the random training and test sets. 
15 We called these two genes the uniformly most significant genes. OPAL 1 /GO 
(38652_at) also consistently shows significance. 

Generation of a robust gene list (a list of uniformly significant genes) 

The following rule was used to assign a quantitative value to each gene 

20 to evaluate the extent that the gene is uniformly significant: in each training and 
test set, the genes are ranked by three measures. After 1 72 resamplings, each 
gene has 1 72 ranks on the three measures in each of two data sets. We calculate 
the average or mean of the 1 72 ranks of each gene. We then sorted the genes on 
the mean ranks. In this way we get a robust gene list corresponding to each of 

25 three measures in each of the two data sets. 

The top 100 genes in the robust gene list are presented in Table 30 with 
the robust ranks determined by the three different measures. We found that the 
ranks in training set and test set closely agree with each other and with the rank 
determined by the overall data set. The two most uniformly significant genes 

30 (39418__at and 41819_at) were ranked first and second. OPAL 1 /GO survives in 
this analysis and had good average ranks on the three measures, but was only 
about 10* best overall. 



113 



Table 29. Ranks of significant Genes Generated in Original Training, Test and 



Overal 


I Data Sets 


In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 




A 


F 


TNoM 


A 


F 


TNoM 


A 


F 


TNoM 


Gene Description 


tvanx 


RonV 
KauK 


IvallK 


KanK 


Rank 


Rank 


Rank 


Rank 


Rank 






1 


1 


1 


7695 


7493 


7251 


10 


7 


6 




hypothetical 
protein 

FT 19ft 1 


2 


2 


54 


60 


122 


94 


1 


1 


7 


39418_at 


DKFZP564M182 
protein 


3 


5 


22 


3757 


3530 


4708 


14 


17 


32 


^n*! /o_ai 


Homo sapiens 
cDNA FLJ30991 
fis, clone 

rlLrUlN VJ 1 \j\j\)\J*\ 1 


4 


14 


32 


8337 


8425 


1894 


132 


253 


266 


37674_at 


aminolevulinate, 
delta-, synthase 1 


5 


6 


10 


4353 


4210 


5827 


31 


23 


83 


joz /u_at 


poly (AUr- 
ribose) 

glycohydrolase 


6 


3 


49 


2354 


818 


2966 


12 


2 


81 


38119_at 


glycophorin C 
(Gerbich blood 
group) 


7 


A 

4 


35 


1026 


945 


2202 


6 


3 


65 


671_at 


secreted protein, 
acidic, cysteine- 
rich (osteonectin) 


8 


20 


12 


1702 


933 


1418 


8 


12 


66 


1126_s_at 


Homo sapiens 
CD44 isoform 
RC (CD44) 
mRNA, complete 
cds 


9 


7 


38 


3684 


7525 


5011 


25 


78 


143 


O 1 COT 

3 1 D27_at 


ribosomal 
protein S2 


10 


9 


61 


7679 


6989 


7628 


150 


166 


286 


587_at 


endothelial 
differentiation, 
sphingolipid G- 
protein-coupled 
receptor, 1 


11 


26 


45 


3263 


4366 


6960 


30 


86 


168 


36144_at 


KIAA0080 
protein 


12 


22 


63 


6526 


6224 


7633 


97 


125 


204 


625_at 


membrane 
protein of 
cholinergic 
synaptic vesicles 


13 


10 


212 


6098 


6724 


5394 


75 


93 


335 


34760_at 


KIAA0022 gene 
product 


14 


18 


143 


2541 


1713 


7043 


20 


21 


359 


36927_at 


hypothetical 
protein, 
expresscQ m 
osteoblast 


15 


8 


17 


5147 


5142 


7971 


72 


34 


162 


35796_at 


protein tyrosine 
kinase 9-like 
(A6-related 
protein) 


16 


35 


14 


7445 


8457 


7792 


175 


205 


460 


32336_at 


aldolase A, 

fructose- 

bisphosphate 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


/\cccssion 
# 


Gene Description 


A 
Rank 


F 
Rank 


TNoM 
Rank 


A 

Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


17 


161 


74 


6925 


5891 


6648 


138 


374 


318 


33188_at 


peptidylprolyl 

isomerase 

(cyclophilin)-like 

2 


18 


109 


11 


38 


63 


104 


2 


8 


2 


41819_at 


FYN-binding 
protein (FYB- 
120/130) 


19 


56 


36 


3000 


4192 


4982 


45 


161 


139 


2062_at 


insulin-like 
growth factor 
binding protein 7 


20 


43 


124 


6998 


5801 


6770 


333 


514 


1373 


34349 at 


SEC63 protein 


21 


25 


184 


7476 


7310 


8582 


168 


175 


1219 


932_i_at 


zinc finger 
protein 91 
(HPF7, HTFIO) 


22 


198 


149 


2380 


3049 


2927 


36 


238 


80 


37748_at 


KIAA0232 gene 
product 


23 


12 


83 


3966 


8153 


4329 


115 


231 


175 


38440_s_at 


hypothetical 
protein 


24 


33 


96 


6080 


6141 


6364 

• 


144 


119 


856 


106_at 


runt-related 
transcription 
factor 3 


25 


54 


20 


80 


90 


177 


4 


6 


3 


37343_at 


inositol 1,4,5- 
triphosphate 
receptor, type 3 


26 


59 


199 


3436 


3294 


6609 


78 


123 


316 


32703_at 


serine/threonine 
kinase 18 


27 


31 


18 


1805 


2464 


4031 


35 


36 


121 


36154_at 


KIAA0263 gene 
product 


28 


50 


48 


1479 


1275 


1931 


1520 


2214 


3445 


38111_at 


chondroitin 
sulfate 

proteoglycan 2 
(versican) 


29 


36 


5 


4225 


4623 


4966 


68 


111 


19 


1980_s_at 


non-metastatic 
cells 2, protein 
(NM23B) 
expressed in 


30 


21 


214 


4722 


4614 


6831 


87 


58 


693 


34965_at 


cystatin F 
(leukocystatin) 


31 


39 


118 


410 


385 


297 


9 


10 


11 


33412_at 


lectin, 
galactoside- 
binding, soluble, 
1 (galectin 1) 


32 


48 


159 


4699 


3446 


7359 


667 


1045 


2761 


39607_at 


myotubularin 
related protein 8 


33 


87 


677 


4246 


4880 


4929 


908 


1194 


4856 


1698_g_at 


mitogen- 
activated protein 
kinase kinase 5 


34 


41 


42 


7549 


7856 


7947 


195 


212 


119 


35322_at 


Kelch-like ECH- 
associated 
protein 1 


35 


200 


75 


2290 


4897 


5290 


53 


484 


155 


33866 at 


tropomyosin 4 


36 


23 


728 


1700 


2677 


1584 


37 


54 


149 


32623_at 


gamma- 
aminobutyric 
acid (GABA) B 
receptor, 1 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene Description 


A 
Rank 


F 
Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


37 


38 


348 


2662 


3937 


4001 


57 


67 


1022 


35939__s_at 


POU domain, 
class 4, 
transcription 
factor 1 


38 


24 


132 


6369 


8517 


6890 


629 


371 


346 


35614_at 


transcription 
factor-like 5 
(basic helix- 
loop-helix) 


39 


15 


422 


3450 


2407 


4730 


91 


25 


417 


41656_at 


N- 

myri stoyl transfer 
ase 2 


40 


82 


299 


5587 


5878 


5033 


215 


354 


454 


31830 s at 


smoothelin 


41 


28 


297 


4620 


2982 


5023 


140 


51 


892 


31695_g_a 
t 


regulatory solute 
carrier protein, 
family 1, 
member 1 


42 


27 


210 


2295 


3602 


1699 


67 


68 


112 


34433_at 


docking protein 
l,62kD 

(downstream of 
tyrosine kinase 
1) 


43 


67 


432 


656 


367 


3375 


16 


13 


205 


824_at 


glutathione-S- 
transferase like; 
glutathione 
transferase 
omega 


44 


53 


631 


5724 


6981 


6154 


712 


587 


2164 


40817 at 


nucleobindin 1 


45 


37 


87 


3277 


3624 


6098 


88 


81 


400 


40365_at 


guanine 
nucleotide 
binding protein 
(G protein), 
alpha 15 (Gq 
class) 


46 


321 


183 


4355 


2425 


4813 


1178 


4723 


2240 


843_at 


protein tyrosine 
phosphatase type 
IV A, member 1 


47 


29 


170 


7282 


6865 


6155 


523 


402 


583 


4082 l_at 


S- 

adenosylhomocy 
steine hydrolase 


48 


81 


101 


8352 


6490 


3444 


308 


737 


623 


1452_at 


LIM domain 
only 4 


49 


11 


2 


2576 


5715 


3725 


54 


101 


5 


33415_at 


non-metastatic 
cells 2, protein 
(NM23B) 
expressed in 


50 


72 


311 


1693 


2506 


930 


41 


79 


313 


32629_f_a 
t 


butyrophilin, 
subfamily 3, 
member Al 


51 


30 


19 


5994 


5551 


4154 


846 


652 


1057 


37147_at 


stem cell growth 
factor; 
lymphocyte 
secreted C-type 
lectin 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


52 


57 


162 


6231 


6377 


8551 


232 


225 


1144 


39932_at 


Homo sapiens 
mRNA; cDNA 
DKFZp586F2224 
(from clone 
DKFZp586F2224 

) 


53 


74 


26 


1585 


1098 


2297 


47 


35 


17 


1711_at 


tumor protein 
p53-binding 
protein, 1 


54 


274 


21 


3295 


2921 


3154 


74 


278 


43 


40141 at 


cullin 4B 


55 


16 


46 


3687 


5454 


1826 


1278 


442 


252 


36537_at 


Rho-specific 
guanine 
nucleotide 
exchange factor 
pll4 


56 


62 


33 


5966 


5635 


7169 


220 


214 


173 


37986_at 


erythropoietin 
receptor 


57 


55 


24 


1793 


2145 


4887 


44 


50 


95 


1403_s_at 


small inducible 
cytokine A5 
(RANTES) 


58 


185 


201 


5797 


4517 


2477 


159 


331 


151 


32843_s_a 
t 


flbrillarin 


59 


88 


265 


5254 


3724 


4435 


202 


170 


565 


39302 at 


desmocoUin 2 


60 


13 


606 


2770 


1145 


5922 


82 


11 


771 


38971_r_a 
t 


Nef-associated 
factor 1 


61 


40 


40 


5525 


6158 


6715 


245 


211 


482 


33757_f_a 
t 


pregnancy 
specific beta-1- 
glycoprotein 1 1 


62 


286 


28 


2620 


2264 


5008 


83 


236 


142 


31472_s_a 
t 


Homo sapiens 
CD44 isoform 
RC (CD44) 
mRNA, complete 
cds 


63 


305 


318 


1023 


2872 


307 


26 


310 


154 


33637__g_a 
t 


cancer/testis 
antigen 


64 


184 


190 


4452 


3255 


3517 


223 


241 


445 


207_at 


stress-induced- 
phosphoprotein 1 
{Hsp70/Hsp90- 

organizing 
protein) 


65 


101 


399 


5221 


4264 


7422 


249 


206 


798 


40183_at 


coactivator- 

associated 

arginine 

methyltransferase 
-1 


66 


91 


56 


2163 


3116 


3162 


1969 


1848 


2792 


40246_at 


discs, large 
(Drosophila) 
homolog 1 


67 


19 


370 


2898 


1532 


2878 


107 


20 


260 


37280_at 


MAD (mothers 
against 

decapentaplegic, 
Drosophila) 
homolog 1 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


68 


71 


911 


2538 


3388 


5963 


1680 


1549 


7785 


3922 l_at 


leukocyte 
immunoglobulin- 
like receptor, 
subfamily B 
(with TM and 
ITIM domains), 
member 2 


69 


203 


7 


437 


440 


929 


3017 


4275 


466 


32624_at 


DKFZp566D133 
protein 


70 


60 


94 


6844 


6653 


6358 


785 


640 


425 




NO SIF seq 


71 


76 


817 


4663 


4498 


5550 


1073 


1187 


2548 


36060_,at 


signal 
recognition 
particle 54kD 


72 


44 


627 


2530 


2272 


6120 


113 


52 


402 


40507_at 


solute carrier 
family 2 
(facilitated 
glucose 
transporter), 
member 1 


73 


58 


307 


4991 


4702 


5083 


254 


171 


225 


3221 l_at 


proteasome 
(prosome, 
macropain) 26S 
subunit, non- 
ATPase, 13 


74 


46 


825 


3943 


2954 


8016 


191 


70 


2586 


36500_at 


NAD(P) 

dependent 

steroid 

dehydrogenase- 
like;H105e3 


75 


264 


397 


5397 


4257 


7394 


224 


362 


572 


39865_at 


Homo sapiens 
cDNA FLJ30639 
fis, clone 
CTONG2002803 


76 


77 


104 


4288 


5778 


2331 


1055 


679 


444 


2035_s_at 


enolase 1, 
(alpha) 


11 


97 


373 


2644 


2657 


5748 


94 


117 


738 


37572 at 


cholecystokinin 


78 


45 


111 


5526 


6106 


3614 


197 


201 


226 


32254_at 


vesicle- 
associated 
membrane 
protein 2 
(synaptobrevin 
2) 


79 


291 


92 


4357 


7049 


4748 


188 


790 


202 


41761_at 


TIAl cytotoxic 
granule- 
associated RNA- 
binding protein- 
like 1 


80 


242 


233 


8287 


8066 


7012 


478 


956 


1963 


36624_at 


IMP (inosine 
monophosphate) 
dehydrogenase 2 


81 


133 


240 


1388 


1748 


1871 


2911 


2910 


2622 


37263_at 


gamma-glutamyl 

hydrolase 

(conjugase, 

folylpolygamma 

glutamyl 

hydrolase) 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 


Gene Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


82 


103 


175 


2570 


3861 


4671 


112 


158 


88 


41224_^at 


KIAA0788 
protein 


83 


64 


250 


917 


955 


1183 


38 


26 


371 


38087_s_at 


SI 00 calcium- 
binding protein 
A4 (calcium 
protein, 
calvasculin, 
metastasin, 
murine placental 
homolog) 


84 


129 


31 


6589 


4786 


1770 


417 


305 


13 


35669_at 


KIAA0633 
protein 


85 


212 


119 


1435 


3718 


3729 


2286 


2573 


2422 


33433_at 


DKFZP564F052 
2 protein 


86 


183 


244 


5029 


5157 


5729 


241 


394 


261 


37441 at 


lipoyltransferase 


87 


83 


228 


7786 


7738 


8485 


451 


283 


1025 


36002_at 


KIAA1012 
protein 


88 


120 


548 


7750 


7722 


7015 


515 


548 


1968 


36678 at 


transgelin 2 


89 


42 


139 


1062 


926 


163 


32 


18 


15 


36129_at 


KIAA0397 gene 
product 


90 


34 


200 


259 


1166 


25 


15 


19 


10 


32724_at 


phytanoyl-CoA 
hydroxylase 
(Refsum disease) 


91 


65 


57 

- 


4461 


4427 


4570 


176 


159 


809 


40435__at 


solute carrier 
family 25 
(mitochondrial 
carrier; adenine 
nucleotide 
translocator), 
member 6 


92 


132 


68 


2452 


3105 


1473 


95 


163 


18 


1923 at 


cyclin C 


93 


70 


142 


6343 


7528 


7031 


860 


689 


719 


36835_at 


protein kinase C- 
like2 


94 


157 


103 


7459 


4945 


3449 


738 


1513 


1241 


1473_s_at 


v-myb avian 
myeloblastosis 
viral oncogene 
homolog 


95 


158 


410 


585 


1147 


217 


3710 


3944 


2837 


41060 at 


cyclin El 


96 


240 


277 


6070 


4715 


4629 


279 


419 


820 


40859_at 


Homo sapiens 
mRNA; cDNA 
DKFZp762G207 
(from clone 
DKFZp762G207) 


97 


190 


9 


8035 


6314 


5815 


574 


560 


542 


38134_at 


pleiomorphic 
adenoma gene 1 


98 


32 


235 


2988 


3846 


4106 


145 


55 


515 


36783_f_at 


Krueppel-related 
zinc finger 
protein 


99 


259 


437 


5264 


5003 


4852 


274 


443 


1646 


1062^_at 


interleukin 10 
receptor, alpha 


100 


227 


823 


2199 


1173 


4045 


111 


122 


1035 


36207_at 


SEC14(S. 
cerevisiae)-like 1 



= AFFX-HUMGAPDH/M33197_M_at 
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Table 30. Lists of Most Uniformly Significant Genes 



In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 

# 


Gene 
Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TnoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


1 


1 


6 


1 


1 


2 


1 


1 


7 


39418_at 


DKFZP564M1 
82 protein 


2 


8 


2 


3 


8 


1 


2 


8 


2 


41819 at 


FYN-bindine 
protein (FYB- 
120/130) 


'3 
J 


A 
H 




£. 




zu 




c 


HZ 


37981 at 


drebrin 1 


4 


2 


1 


4 


5 


3 


5 


4 


1 


S77 at 


(neurite 
growth- 

T^rAiTi Ati no 

J^i J 1 1 til 

factor 2) 


5 


5 


5 


5 


9 


5 


4 


6 


3 


37343_at 


inositol 1,4,5- 
triphosphate 
receptor, type 
3 


6 


9 


44 


7 


6 


23 


7 


9 


71 


32058_at 


HNK-1 

sulfotransferas 
e 


/ 


1 A 
10 


1 A 
10 


1 A 
10 


iz 


12 




1 A 


1 1 


33412_at 


lectin, 

binding, 
soluble, 1 
(galectin 1) 


8 


12 


31 


14 


20 


13 


8 


12 


66 


1 1 9A c at 


Jriorno odpienb 
CD44 isoform 
RC (CD44) 

complete cds 


9 


6 


52 


6 


4 


46 


6 


3 


65 


671_at 


secreted 
protein, acidic, 
cysteine-rich 
(osteonectin) 


10 


13 


23 


9 


14 


15 


11 


14 


35 


jZJzf i\j I ai 


inLraCCilUJ oT 

hyaluronan- 
binding protein 


11 


11 


116 


18 


19 


317 


16 


13 


205 


824_at 


glutathione-S- 
transferase 

IllVC, 

glutathione 
transferase 
omega 


12 


17 


9 


19 


30 


10 


15 


19 


10 


32724_at 


phytanoyl- 
CoA 

hydroxylase 

(Refsum 

disease) 


13 


7 


8 


13 


7 


18 


10 


7 


6 


38652_at 


hypothetical 

protein 

FLJ20154 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene 
Description 


A 
Rank 


F 
Rank 


TNoM 
Rank 


A 

Rank 


F 
Rank 


TnoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


14 


22 


41 


15 


27 


39 


13 


24 


40 


3633 l_at 


Homo sapiens 
mRNA; cDNA 
DKFZp586C0 
91 (from clone 
DKFZp586C0 
91) 


15 


19 


30 


8 


13 


24 


14 


17 


32 


41478_at 


Homo sapiens 
cDNA 

FLJ30991 fis, 
clone 

HLUNG 10000 
41 


16 


3 


117 


11 


2 


128 


12 


2 


81 


38119_at 


glycophorin C 
(Gerbich blood 
group) 


17 


24 


417 


34 


28 


401 


20 


21 


359 


36927_at 


hypothetical 
protein, 
expressed in 
osteoblast 


18 


38 


81 


27 


49 


71 


18 


33 


53 


35145_at 


MAX binding 
protein 


19 


248 


122 


52 


414 


91 


26 


310 


154 


33637_g_a 
t 


cancer/testis 
antigen 


20 


15 


186 


92 


71 


558 


38 


26 


371 


38087_s_at 


SI 00 calcium- 
binding protein 
A4 (calcium 
protein, 
calvasculin, 
metastasin, 
murine 
placental 
homolog) 


21 


104 


643 


23 


118 


275 


28 


120 


1044 


36576_at 


H2A histone 
family, 
member Y 


22 


31 


64 


20 


18 


75 


24 


31 


62 


40523_at 


hepatocyte 
nuclesu- factor 
3, beta 


23 


40 


12 


12 


21 


7 


17 


29 


12 


34332_at 


glucosamine- 

6-phosphate 

isomerase 


24 


60 


180 


16 


46 


134 


21 


59 


314 


32650_at 


neuronal 
protein 


25 


960 


21 


31 


599 


9 


19 


767 


9 


41727_at 


KIAA1007 
protein 


26 


79 


230 


47 


141 


145 


25 


78 


143 


31527_at 


ribosomal 
protein S2 


27 


83 


60 


36 


105 


55 


22 


62 


27 


38437_at 


MLN51 
protein 


28 


20 


118 


22 


15 


90 


23 


16 


122 


36524_at 


Rho guanine 
nucleotide 
exchange 
factor (GEF) 4 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene 
Description 


A 
Rank 


F 
Rank 


TNoM 

Rank 


A 

Rank 


F 

Rank 


TnoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


29 


56 


70 


49 


90 


116 


43 


77 


165 


36081_s_at 


chromosome 
21 open 
reading frame 
18 


30 


47 


191 


37 


38 


106 


33 


41 


294 


160030_at 


growth 

hormone 

receptor 


31 


102 


146 


42 


111 


113 


30 


86 


168 


36144_at 


KIAA0080 
protein 


32 


244 


108 


87 


341 


239 


36 


238 


80 


37748_at 


KIAA0232 
gene product 


33 


26 


90 


32 


17 


141 


31 


23 


83 


38270_at 


poly (ADP- 

ribose) 

glycohydrolase 


34 


63 


132 


35 


41 


97 


37 


54 


149 


32623_at 


gamma- 
aminobutyric 
acid (GABA) 
B receptor, 1 


35 


57 


158 


30 


67 


61 


50 


69 


296 


1676_s__at 


eukaryotic 
translation 
elongation 
factor 1 
gamma 


36 


165 


61 


21 


121 


50 


34 


149 


28 


38865_at 


GRB2-related 
adaptor protein 
2 


37 


28 


157 


74 


63 


171 


76 


43 


310 


324 f at 


NO_.SIF_seq 


38 


84 


3 


59 


119 


4 


54 


101 


5 


33415_at 


non-metastatic 
cells 2, protein 
(NM23B) 
expressed in 


39 


134 


136 


28 


80 


64 


27 


71 


156 


34171_at 


hypothetical 
protein from 
EUROIMAGE 
2021883 


40 


21 


24 


44 


23 


34 


32 


18 


15 


36129_at 


KIAA0397 
gene product 


41 


106 


29 


40 


82 


33 


56 


135 


14 


36004_at 


Homo sapiens 
cDNA 

FLJ20586 fis, 
clone 

KAT09466, 
highly similar 
to AF091453 
Homo sapiens 
NEMO protein 


42 


39 


66 


64 


68 


74 


42 


37 


94 


1189_at 


cyclin- 
dependent 
kinase 8 


43 


48 


154 


50 


51 


92 


44 


50 


95 


1403_s_at 


small 
inducible 
cytokine A5 
(RANTES) 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
U 


Gene 
Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 
Rank 


F 
Rank 


TnoM 

Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


44 


54 


779 


56 


64 


557 


57 


67 


1022 


35939_s_at 


POU domain, 
class 4, 

transcription 
factor 1 


45 


30 


379 


67 


47 


429 


60 


38 


246 


35675_at 


vinexin beta 
(SH3- 
containing 
adaptor 
molecule- 1) 


46 


33 


26 


103 


72 


84 


77 


44 


25 


35856_r__at 


glutamate 
receptor, 
ionotropic, 
kainate 1 


47 


37 


516 


55 


43 


265 


49 


40 


442 


1818 at 


NO .SIF seq 


48 


197 


56 


17 


65 


19 


29 


142 


37 


35059_at 


Homo sapiens 
clone FBAl 
Cri-du-chat 
region mRNA 


49 


65 


37 


71 


92 


45 


39 


53 


78 


36069_at 


KIAA0456 
protein 


50 


94 


11 


78 


156 


11 


68 


111 


19 


1980__s„at 


non-metastatic 
cells 2, protein 
(NM23B) 
expressed in 


51 


81 


147 


45 


79 


63 


46 


75 


150 


32739_at 


N- 

acetylglucosa 
mine- 
phosphate 
mutase 


52 


115 


85 


51 


112 


144 


51 


114 


57 


361_at 


B-cell 

CLL/lymphom 
a9 


53 


100 


256 


39 


96 


112 


41 


79 


313 


32629_f_at 


butyrophilin, 
subfamily 3, 
member Al 


54 


189 


181 


33 


115 


76 


45 


161 


139 


2062_,at 


insulin-like 
growth factor 
binding protein 
7 


55 


55 


106 


29 


34 


60 


35 


36 


121 


36154_at 


KIAA0263 
gene product 


56 


88 


566 


48 


99 


291 


52 


84 


663 


32878_f_at 


Homo sapiens 
cDNA 

FLJ32819fis, 
clone 

TESTI200293 
7, weakly 
similar to 
HISTONE 
H3.2 


57 


27 


196 


97 


50 


400 


72 


34 


162 


35796_at 


protein 

tyrosine kinase 
9-like (A6- 
related protein) 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene 
Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 

Rank 


F 

Rank 


TnoM 
Rank 


A 
Rank 


F 

Rank 


TNoM 
Rank 


58 


41 


315 


25 


22 


198 


40 


32 


273 


39518_at 


Homo sapiens, 
clone 

MGC:9628 
IMAGE:39133 
ll,mRNA, 
complete cds 


59 


92 


33 


65 


107 


30 


58 


90 


39 


35425_at 


BarH-like 
homeobox 2 


60 


32 


264 


114 


76 


216 


73 


42 


622 


143_s__at 


TAF5 RNA 
polymerase II, 
TATA box 
binding protein 
(TBP)- 
associated 
factor, lOOkD 


61 


91 


59 


26 


52 


28 


55 


85 


52 


34238_at 


immunoglobuli 
n superfamily, 
member 1 


62 


525 


194 


63 


480 


179 


53 


484 


155 


33866 at 


tropomyosin 4 


63 


80 


513 


75 


120 


579 


94 


117 


738 


37572_at 


cholecystokini 
n 


64 


34 


459 


70 


53 


336 


80 


49 


1089 


37961_at 


phosphoinositi 

de-3-kinase, 

regulatory 

subunit, 

polypeptide 3 
(p55, gamma) 


65 


67 


1046 


94 


97 


610 


92 


95 


1403 


35201_at 


heterogeneous 
nuclear 

ribonucleoprot 
ein L 


66 


49 


140 


126 


124 


99 


93 


83 


135 


1255_g_at 


guanylate 
cyclase 
activator lA 
(retina) 


67 


62 


67 


95 


62 


88 


63 


56 


54 


35368_at 


zinc finger 
protein 207 


68 


259 


25 


122 


345 


48 


74 


278 


43 


40141 at 


cullin 4B 


69 


29 


45 


98 


56 


100 


59 


27 


82 


38124_at 


midkine 
(neurite 
growth- 
promoting 
factor 2) 


70 


16 


43 


61 


11 


115 


70 


15 


44 


40617_at 


hypothetical 

protein 

FLJ20274 


71 


35 


1074 


62 


33 


703 


61 


30 


1527 


38970_s_a 
t 


Nef-associated 
factor 1 


72 


42 


84 


41 


25 


65 


48 


28 


84 


38684_at 


ATPase, Ca-H- 
transporting, 
type 2C, 
member 1 


73 


50 


207 


68 


37 


180 


66 


47 


283 


41535_at 


CDK2- 
associated 
protein 1 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 

44 
it 


Gene 
Description 


A 
Rank 


F 
Rank 


TNoM 
Rank 


A 
Rank 


F 

Rank 


TnoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


74 


103 


240 


171 


226 


228 


78 


123 


316 


32703_at 


serine/threonin 
e kinase 1 8 


75 


46 


4 


83 


32 


8 


62 


39 


4 


36295_at 


zinc finger 
protein 134 
(clone pHZ-15) 


76 


123 


988 


79 


171 


757 


64 


115 


1181 


41208 at 


SI 64 protein 


77 


93 


394 


167 


242 


242 


103 


138 


481 


33595_r_a 
t 


recombination 
activating gene 
2 


78 


53 


22 


121 


91 


27 


86 


61 


38 


35414_s_a 
t 


jagged 1 
(Alagille 
syndrome) 


79 


132 


203 


91 


131 


168 


108 


154 


215 


31353_f_a 
t 


forkhead box 
E2 


80 


161 


16 


43 


93 


17 


69 


151 


23 


35066_g_ 
at 


fetal 

hypothetical 
protein 


81 


374 


231 


86 


428 


201 


71 


369 


247 


35784_at 


vesicle- 
associated 
membrane 
protein 3 
(cellubrevin) 


82 


240 


174 


138 


356 


129 


83 


236 


142 


31472_s_a 
t 


Homo sapiens 
CD44 isoform 
RC (CD44) 
mRNA, 
complete cds 


83 


86 


82 


84 


100 


138 


67 


68 


112 


34433_at 


docking protein 
l,62kD 

(downstream of 
tyrosine kinase 
1) 


84 


126 


151 


142 


147 


348 


104 


134 


268 


38105_at 


hypothetical 
protein 
FLJ11021 
similar to 
splicing factor, 
arginine/serine- 
rich 4 


85 


76 


76 


107 


117 


157 


129 


128 


103 


31722_at 


ribosomal 

protein L3 


86 


52 


77 


38 


31 


41 


65 


45 


51 


34104J__a 
t 


immunoglobuli 
n heavy 
constant 
gamma 3 (G3m 
marker) 


87 


69 


511 


110 


110 


475 


121 


103 


603 


41825_at 


PTEN induced 
putative kinase 
1 


88 


25 


261 


93 


29 


276 


91 


25 


417 


41656_at 


N- 

myristoyltransf 
erase 2 
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In Training Data Set 


In Test Data Set 


In Overall Data Set 


Accession 
# 


Gene 
Description 


A 
Rank 


F 

Rank 


TNoM 
Rank 


A 

Rank 


F 

Rank 


TnoM 
Rank 


A 
Rank 


F 
Rank 


TNoM 
Rank 


89 


36 


696 


184 


77 


1393 


113 


52 


402 


40507_at 


solute carrier 
family 2 
(facilitated 
glucose 
transporter), 
member 1 


90 


122 


187 


77 


127 


117 


75 


93 


335 


34760_at 


KIAA0022 
gene product 


91 


133 


249 


54 


86 


67 


85 


129 


214 


2092_s__at 


secreted 
phosphoprotein 
1 (osteopontin, 
bone 

sialoprotein I, 
early T- 

lymphocyte 
activation 1) 


92 


428 


609 


248 


604 


598 


123 


468 


859 


1160_at 


cytochrome c- 
1 


93 


137 


267 


127 


207 


256 


81 


133 


262 


37563_at 


KJAA0411 
gene product 


94 


82 


243 


118 


101 


350 


79 


64 


716 


36647_at 


hypothetical 
protein 
FLJ 10326 


95 


718 


568 


174 


1053 


427 


122 


851 


661 


32841_at 


zinc finger 

protein 9 (a 

cellular 

retroviral 

nucleic acid 

binding 

protein) 


96 


237 


79 


123 


284 


51 


109 


266 


107 


33469_r_at 


complement 
factor H 
related 3 


97 


61 


13 


24 


26 


6 


47 


35 


17 


1711_at 


tumor protein 
p5 3 -binding 
protein, 1 


98 


136 


302 


46 


98 


103 


89 


137 


231 


32822_at 


solute carrier 
family 25 
(mitochondrial 
carrier; 
adenine 
nucleotide 
translocator), 
member 4 


99 


51 


19 


183 


106 


78 


116 


63 


31 


41252_s_at 


Homo sapiens 
cDNA 

FLJ30436 fis, 
clone 

BRACE20090 
37 


100 


71 


414 


53 


42 


252 


87 


58 


693 


34965_at 


cystatin F 
(leukocystatin) 
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EXAMPLE X. 

Threshold Independent Approach to Accessing Significance of OPAL 1 /GO and 

OPALl/GO-Iike genes 

5 Threshold independent supervised learning algorithms (ROC) and Common 

Odds Ratio) were used to identify genes associated with outcome in the 167 member 
pediatric ALL training set described in Example IL Data were normalized using 
Helman-VerofF algorithm. Nonhimian genes and genes with all call being absent 
were removed from the data. 
10 The following lists of genes associated with outcome (CCR vs. FAIL) were 

identified. 

Table 31. ROC Curve Approach (Threshold Independent Method 1) 
Top genes ranked in terms of ROC Accuracy 

15 



Rank 


A 


Access# 


Gene Description | 












1 


0.7131 


38652_at 


hypothetical protein FLJ20154 












2* 


0.6905 


3941 8_at 


DKFZP564M182 protein 












3 


0.6667 


41478_at 


Homo sapiens cDNA FU30991 fis. clone HLUNG1000041 






4* 


0.6653 


37674_at 


aminolevulinate, delta-, synthase 1 










5 


0.6612 


38270_at 


poly (ADP-ribose) glycohydrolase 










6* 


0.6572 


671_at 


secreted protein, acidic, cysteine-rich (osteonectin) 








7* 


0.6546 


1126_s_at 


Homo sapiens CEM4 isofbmi RC (CD44) mRNA, complete cds 






8* 


0.6529 


38119_at 


glycophorin C (Gerbich blood group) | 








9 


0.6527 


625_at 


membrane protein of cholinergic synaptic \esicles 








10* 


0.6524 


31527_at 


ribosomal protein S2 | | | 








11 


0.6516 


587_at 


endothelial diffierentiation. sphingolipid G-protein-coupled receptor. 1 




12* 


0.6513 


36144_at 


KIAA0080 protein | | 










13 


0.6485 


41819_at 


FYN-binding protein (FYB-120/130) 










14 


0.6454 


36927_at 


hypothetical protein, expressed in osteoblast 








15* 


0.6451 


34760_at 


KIAA0022 gene product 












16 


0.6434 


37748_at 


KIAA0232 gene product 












17 


0.6433 


33188_at 


peptidylprolyl isomerase (cyclophilin)-like 2 










18* 


0.6425 


32336_at 


aldolase A, fiructose-bisphosphate 










19 


0.6419 


34349_at 


SEC63 protein | | 










20* 


0.6418 


35796_at 


protein tyrosine kinase 9-like (AB-related protein) 









indicates low expression value predicts CCR 
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Table 32. Common Odds Ratio Approach (Threshold Independent Method 2) 
Top genes ranked in terms of common odds ratio 



Rank 1 


Odds Ratio 


Rank 2 


A 


Access# 


Gene Description | 












3.696 


1 


0.7131 


38652 at 


hypothetical protein FLJ20154 










2* 


3.232 


2 


0.6905 


39418 at 


DKFZP564M182 protein 










3 


2.725 


3 


0.6667 


41478 at 


Homo sapiens cDNA FU30991 fis, clone HLUNG1000041 




4* 


2.696 


4 


0.6653 


37674 at 


aminole\Ajlinate, delta-, synthase 1 








5 


2.592 


5 


0.6612 


38270 at 


poly (ADP-ribose) glycohydrolase 








6* 


2.575 


6 


0.6572 


671 at 


secreted protein, acidic, cysteine-rich (osteonectin) 






7* 


2.558 


7 


0.6546 


1126 s at 


Homo sapiens CD44 isotbrm RC (CD44) mRNA, complete cds 




8* 


2.541 


8 


0.6529 


38119_at 


glycophorin C (Gerbich blood group) { 






9 


2.522 


9 


0.6527 


625_at 


membrane protein of cholinergic synaptic vesicles 






10* 


2.512 


12 


0.6513 


36144_at 


KIAA0080 protein | | | 






11 


2.469 


11 


0.6516 


587__at 


endothelial differentiation, sphingolipid G-protein-coupled receptor. 1 


12* 


2.449 


10 


0.6524 


31527_at 


ribosomal protein S2 | 










13* 


2.441 


15 


0.6451 


34760_at 


KIAA0022 gene product 










14 


2.426 


16 


0.6434 


37748_at 


KIAA0232 gene product 










15 


2.413 


14 


0.6454 


36927_at 


hypothetical protein, expressed in osteoblast 






16 


2.406 


13 


0.6485 


41819_at 


FYN-binding protein (FYB-120/130) 








ir 


2.398 


18 


0.6425 


32336_at 


aldolase A, fructose-bis phosphate 








18* 


2.367 


24 


0.6393 


2062_at 


insulin-like growth factor binding protein 7 








19 


2.363 


17 


0.6433 


33188_at 


peptidylprolyl isomerase (cyc!ophil{n)-like 2 









* indicates low expression value predicts CCR 
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Table 33. Comparison between several gene lists 



Rank 1 


A 


Rank 2 


Odds Ratio 


Rank 3 


F 


p-value 


Access# 




1 


0.7131 


1 


3.696 


1 


23.327 


0 


38652_at 




2* 


0,6905 


2 


3.232 


2 


14.964 


0.00016 


39418_at 




3 


0.6667 


3 


2.725 


5 


13.543 


0.00032 


41478_at 




4* 


0.6653 


4 


2.696 


14 


10.31 


0.00159 


37674_at 




5 


0.6612 


5 


2.592 


6 


13.314 


0.00035 


38270_at 




6* 


0.6572 


6 


2.575 


4 


13.886 


0.00027 


671_at 




7* 


0.6546 


7 


2.558 


20 


10.037 


0.00183 


1126_s_at 




8* 


0.6529 


8 


2.541 


3 


14.874 


0.00016 


38119_at 




9 


0.6527 


9 


2.522 


22 


9.958 


0.0019 


625_at 




10* 


0.6524 


12 


2.449 


7 


13.178 


0.00038 


31527_at 




11 


0.6516 


11 


2.469 


9 


12.544 


0.00052 


587_at 




12* 


0.6513 


10 


2.512 


26 


9.759 


0.00211 


36144_at 




13 


0.6485 


16 


2.406 


109 


7.091 


0.00851 


41819_at 




14 


0.6454 


15 


2.413 


18 


10.16 


0.00172 


36927_at 




15* 


0.6451 


13 


2.441 


10 


10.867 


0.0012 


34760_at 




16 


0.6434 


14 


2.426 


198 


5.68 


0.0183 


37748_at 




17 


0.6433 


19 


2.363 


161 


6.039 


0.01503 


33188_at 




18* 


0.6425 


17 


2.398 


35 


9.335 


0-00262 


32336_at 




19 


0.6419 


21 


2.339 


43 


8.71 


0.00363 


34349_at 




20* 


0-6418 


27 


2.278 


8 


12.545 


0.00052 


35796_at 





5 * indicates low expression value predicts CCR 
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Table 34. Comparison between several gene lists 



Rank1 


A1 


Rank2 


A2 


Access # 


Gene Description | 








1 


0.7093 


1 


0.713 


38652_at 


hypothetical protein FLJ20154 








2* 


0.6931 


4* 


0.665 


37674_at 


aminolevulinate. delta-, synthase 1 






3 


0.6865 


3 


0.667 


41478_at 


Homo sapiens cDNA FLJ30991 fis, clone HLUNG1 000041 


4* 


0.6776 


50* 


0.629 


34433_at 


docking protein 1. 62kD (downstream of tyrosine kinase 1) 


5* 


0.6771 


18* 


0.643 


32336_at 


aldolase A, fiructose-bisphosphate 






6* 


0.6763 


15* 


0.645 


34760_at 


K1AA0022 gene product 








7 


0.6723 


108 


0.618 


40027_at 


hypothetical protein | 








8* 


0.6685 


7* 


0.655 


1126_s_at 


Homo sapiens CD44 isofbnn RC (CD44) mRNA, complete cds 


9 


0.6666 


151 


0.613 


599_at 


H2.0 (DrosophilaHike homeo box 1 






10* 


0.666 


49* 


0.629 


4081 7_at 


nucleobindin 1 | | 






11* 


0.6642 


69* 


0.624 


1403_s_at 


small inducible cytokine A5 (RANTES) 






12 


0.663 


40 


0.632 


1452_at 


LIM domain only 4 | 








13 


0.6627 


34 


0.634 


39607_at 


myotubularin related protein 8 








14* 


0.6623 


110* 


0.618 


1062 q at 


interieuktn 10 receptor, alpha 








15 


0.6615 


238 


0.604 


35260_at 


KIAA0867 protein 










16* 


0.6602 


12* 


0.651 


36144_at 


KIAA0080 protein 










17* 


0.6573 


2* 


0.69 


3941 8_at 


DKF2P564M182 protein 








18 


0.6562 


268 


0.603 


39931_at 


dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 3 


19 


0.6558 


22 


0.64 


38440_s_^ 


hypothetical protein | | | I 



Rank 1 and Al are calculated based on the data with T-cell patients removed. 
Rank 2 and A2 are calculated based on all 167 training data. 

* indicates low expression value predicts CCR 
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Table 35. Comparison between several gene lists 





Al 


F^2 


/C 




r 


QS815 


6GBBr 


0512 


3EB0B_ct s|cOdrgfeidcr, ayiii B^jtjirB<icti6 


2 


09031 


180 


0612 


33<£9_r_£t QC]TfjaTBtfecbTHp^^Bd3 


3 


09135 


719 


Q5B2 


3W^JL I^J^3^pe^7^KodIaBBQoaBdpnafian(1I^n^RSI^3erd 


4 


09071 


56 


05BB 


383Cet NMBaSpdan 


5 


09CFI 


3G2 


Q505 


33219ct odear receptor SLtferriy 3 9CipQnrBTtEr2 


6 


09008 


2720 


056 


3a2M_d fiarl<hBBdb9(D1 


7 


O9005 


860 


0579 


32159«t v444iEi£l^rstEnistsuiJUiu2urdu&JL|^BhcniJ^ 


8 


QSOB 




03» 


3321_sjt cyinEI 


9 


08S;'4 




Q5B2 


TBESSjJt rv(rttiBbcdpdBnRJ14S29 


10 


08878 


144 


0614 


4U27_^ NM1007pdein 


11 


08878 


5788 


0521 


34484£t bn^dnAii f 3 3ta jg ainerudectidbediaiy protein2 


12 


08878 


2& 


05B2 


34364_d pglici^3dyll9cnH3BeE(cydqpran 


13 


08878 


1908 


05B9 


408DB_^ aL-FaArHDR^APaMVB^II. ai>i30tnCNRfiCTCR 


14 


Q8B14 


842 


05re 


366G5_^ C£tt^alk^i(Gd]ggBn^f£lnBQ^ptGr, Uiuitxk|XJiiiinaaE|lQr) 


15 


08782 


7908 


Q5D5 


6Q8_^ 4.Lf|J4JLtartE 


16 


0875 


779 


Q5B1 




17 


oa?5 




0517 


37238 s £t rTBitiTBfRmlHl^iiTBr^ 


18 


0875 


4Q» 


QS6 


3G844ct HnDsqsenEv SrTla'toRl^^ld>A:^Ei00ajrtBf7g3ie; dC7BlM^/»w») nrf?sl^ potid afe 


igr 


Q8718 


Z 


089 


3M18£t CK7=S54Vn82pdai 



5 Rank 1 and Al are calculated based on the T-cell data only. 
Rank 2 and A2 are calculated based on all 167 training data. 



10 The following tables represent consolidations of a number of different gene 

lists representing rankings in B-Cell and T-Cell data sets. 
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Gene Description 


KIAA0328 protein 


CD36 antigen (collagen type I receptor, 
thrombospondin receptor) 


endonuclease G-like 2 I 


Homer, neuronal immediate early gene, IB | 


inhibin, beta B (activin AB beta polypeptide) | 


protein tyrosine phosphatase, receptor type, A | 


MHC class I polypeptide-related sequence A 


insulin promoter factor 1, homeodomain 
transcription factor 


sialyltransferase 4B (beta-galactosidase alpha-2,3- 
sialytransferase) 


glutamate receptor, ionotropic, kainate 1 I 


amine oxidase, copper containing 3 (vascular 
adhesion protein 1) 


N-ethylmaleimide-sensitive factor | 


DEAD/H (Asp-Glu-Ala-Asp/His) box binding 
protein 1 


peroxisome proliferative activated receptor, delta | 


Homo sapiens clone IMAGE 25997 1 


growth hormone receptor | 


CGI-87 protein | 


pleckstrin homology, Sec7 and coiled/coil domains 
2-like 


Homo sapiens, Similar to RIKEN cDNA 
2600001BI7 gene, clone IMAGE:2822298, mRNA, 
partial cds 


KIAA0456 protein | 


retinoschisis (X-linked, juvenile) 1 | 


major histocompatibility complex, class I-like 
sequence 


Accession 


38343 at 


36656_at 


33056 at 


41010 at 


38545 at 


1496 at 


40755 at 


400_at 


40006_at 


35856 r at 


31627_f_at 


38719 at 


36573_at 


37152 at 


41840 r at 


160030 at 


39198 s at i 


38741_at 


39844_at 


36069 at 


34465 at 


34426^at 
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Gene Description 


neuronal protein 
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Rho guanine nucleotide exchange factor (GEF) 4 


hepatocyte nuclear factor 3, beta 


ribosomal protein S2 
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hypothetical protein from EUROIMAGE 2021883 
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Gene Description 


inositol 1,4,5-triphosphate receptor, type 3 


lectin, galactoside-binding, soluble, 1 (galectin 1) 


intracellular hyaluronan-binding protein 


Homo sapiens cDNA FLJ30991 fis, clone 
HLUNG1000041 


Homo sapiens CD44 isoform RC (CD44) mRNA, 
complete cds 


ATPase, Ca++ transportmg, type 2C, member 1 


KIAA0867 protein 


hypothetical protein 


tumor protein p53-binding protein, I 


secreted protein, acidic, cysteine-rich (osteonectin) 


aminolevulinate, delta-, synthase 1 


MAX binding protein 


cyclin-dependent kinase 8 


gamma-aminobutyric acid (GABA) B receptor, I 


butyrophilin, subfamily 3, member Al 


mel transforming oncogene (derived from cell line 
NK14)- RAB8 homolog 


interleukin 10 receptor, alpha 


hypothetical protein, expressed in osteoblast 


jagged 1 (Alagille syndrome) 


proteoglycan 1, secretory granule 


lymphocyte cytosolic protein 1 (L-plastin) 
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Gene Description 


KIAA0022 gene product 


zinc finger protein 207 


hypothetical protein FLJ20274 


aldolase A, fructose-bisphosphate 


hypothetical protein FLJ20154 


KIAA0080 protein 


cyclin C 


KIAA0397 gene product 


vav 1 oncogene 


mitogen-activated protein kinase-activated protein 
kinase 3 


Cdc42 effector protein 3 


glucosamine-6-phosphate isomerase 


hypothetical protein from EUROIMAGE 2021883 


Homo sapiens cDNA FLJ32819 fis, clone 
TESTI2002937, weakly similar to HISTONE H3.2 


dual-specificity tyrosine-(Y)-phosphorylation 
regulated kinase 3 


glycophorin C (Gerbich blood group) 


KIAA0232 gene product 


SI 00 calcium-binding protein A4 (calcium protein, 
calvasculin, metastasin, murine placental homolog) 
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EXAMPLE XL 

Correlated Gene Lists for Outcome Prediction in Pre-B ALL Cohort 



Introduction. This Example summarizes and correlates selected gene 
5 lists predictive of outcome (specifically, CCR vs. Failure) obtained for the pre- 
B ALL cohort described in Example IB. "Task 2" refers to CCR vs. FAIL for 
B-cell + T-cell patients; "Task 2a" is CCR vs. FAIL for B-cell only patients. 
Gene lists selected for evaluation were produced by the following methods: (1) 
a compilation of genes identified using feature selection combined with a 

10 supervised learning techniques such as SVM/RFE, Discriminant Analysis/t-test, 
Fuzzy Inference/rank-ordering statistics, and Bayesian Nets/TNoM; note that 
SVM/RFE and Bayesian Net/TNoM are both multivariate (MV) gene selection 
techniques; the others are univariate; (2) TNoM gene selection; (3) supervised 
classification; (4) empirical CDF/MaxDiff method; (5) threshold independent 

15 approach; (6) GA/KNN; (7) uniformly significant genes via resampling; (8) 
ANOVA "gene contrast" lists derived via Vxinsight. 

The techniques fall into two broad categories, which we have termed 
univariate and multivariate. 



20 Group 1 (univariate). These methods evaluate the significance of a given gene 
in contributing to outcome discrimination on an individual basis* They include: 

o two-sample /-test (here equivalent to F-XqsX or one-way ANOVA) 
o Rank-ordering statistics 

o ROC curves ("threshold-independent method I ") 
25 o Common odds ratio approach ("threshold-independent method 2") 

o "Most uniformly significant genes" via resampling - average rank firom 

172 train/test resamplings of the dataset, for each of 3 different methods: 

F-test, ROC accuracy A, and TNoM score; 
o GA/KNN 

30 o Empirical cumulative distribution function (CDF) MaxDiff approach 

o TNoM method- used to pre-filter genes for use as parent sets in 

constructing (and scoring) competing Bayesian nets that best explain the 
training set data. 
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Group 2 (multivariate). These methods identify groups of genes that act in 
conceit to discriminate outcome. The optimal gene groups are determined via 
an iterative (SVM, stepwise DA) or combinatoric exploration (Bayesian) 
procedure. They include: 



• S VM/RFE (Support Vector Machines with Recursive Feature 
Elimination) 

• Bayesian net evaluation of (via BD metric) of highest-scoring parent sets 
(gene combinations) 

• Stepwise discriminant analysis 



The top genes in each group are identified and to determine how often the same 
genes turn up repeatedly within each group. The following two tables 
correspond to Tasks 2 (Table 40) and 2a (Table 41). The top 20 genes found in 
Table 40 are listed in Table 42 with more detailed annotations. 
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Description 


DKFZP564M 182 protein 


FYN-binding protein 
(FYB-120/130) 


drebrin 1 


midkine (neurite growth- 
promoting factor 2) 


inositol 1,4,5-triphosphate 
receptor, type 3 


HNK-1 sulfotransferase 


lectin, galactoside-binding, 
soluble, 1 (galectin 1) 


Homo sapiens CD44 
isoform RC (CD44) 
mRNA, complete cds 


secreted protein, acidic, 
cysteine-rich (osteonectin) 
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Description 


intracellular hyaluronan- 
binding protein 


glutathione-S-transferase 
like; glutathione transferase 
omega 


phytanoyl-CoA 
hydroxylase (Refsum 
disease) 


hypothetical protein 
FLJ20154 (aka 
hypothetical protein 
FLJ20367, NM 017787) 
(GO) 


Homo sapiens mRNA; 
cDNA DKFZp586C09l 
(from clone 
DKFZp586C091) 


Homo sapiens cDNA 
FLJ3099lfis, clone 
HLUNG1000041 


glycophorin C (Gerbich 
blood group) (NM_002101 
analysis glycophorin C 
isoform 1NM_0 168 15 
analysis glycophorin C 
isoform 2) 


hypothetical protein, 
expressed in osteoblast 


MAX binding protein 


cancer/testis antigen 
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Description 




serine/threonine kinase 18 


small inducible cytokine 
A5 RANTES (chemokine 
(C-C motiO ligand 5) 


wingless-type MMTV 
integration site family, 
member 4 


IMP inosine 
monophosphate 
dehydrogenase 2 


protein phosphatase 2 
regulatory subunit B B56 
gamma isoform 


upstream binding 
transcription factor RNA 
polymerase I 


erythropoietin receptor 
precursor 


pyruvate dehydrogenase 
kinase isoenzyme 1 


GRB2-related adaptor 
protein 2 


Nef-associated factor 1 


SMT3 (suppressor of mif 
two 3, yeast) homolog 2 


Cdc42 effector protein 3 


protein tyrosine kinase 9- 
like (A6-related protein) 
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Description 


hepatocyte nuclear factor 3, 
beta 


syntaxin I A (brain) 


ATPase, H+ transporting, 
lysosomal (vacuolar proton 
pump), alpha polypeptide, 
70kD, isoform I 


type 1 tumor necrosis factor 
receptor shedding 
aminopeptidase regulator 
(NM_001750 analysis 
calpastatin) 


Nef-associated factor I 


KIAA0999 protein 
(hypothetical protein 
FLJ 12240) 


tropomyosin 4 


glucosamine-6-phosphate 
isomerase 


PIBFl gene product 
(progesterone-induced 
blocking factor 1) 


polymyositis/scleroderma 
autoantigen 1 (75kD) 


annexin A2 pseudogene 3 


zinc finger protein 134 
(clone pHZ-15) 
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Description 


poly (ADP-ribose) 
glycohydrolase 


KIAA0769 gene product 


aldolase A, fructose- 
bisphosphate 


low density lipoprotein 
receptor-related protein 8, 
apolipoprotein e receptor 


chromosome 19 open 
reading frame 3 (regulator 
of G-protein signalling 19 
interacting protein 1) 


KIAA0263 gene product 


stem cell growth factor; 
lymphocyte secreted C-type 
lectin 


cullin 4B 


KIAA1007 protein 


protein tyrosine 
phosphatase, receptor type, 
K 


tumor protein p53-binding 
protein, I 


arachidonate 5- 
lipoxygenase 


tankyrase, TRFl- 
interacting ankyrin-related 
ADP-ribose polymerase 


endothelial differentiation, 
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Description 


collapsin response mediator 
protein 1 


sequestosome 1 


ribosomal protein S6 
kinase, 90kD, polypeptide 
3 


engulfinent and cell 
motility 1 (ced-12 
homolog, C. elegans) 


Cas-Br-M (murine) 
ectropic retroviral 
transforming sequence b 


origin recognition complex, 
subunit 5 (yeast homolog)- 
like 


proteoglycan 1, secretory 
granule 


calponin 3, acidic 


putative integral membrane 
transporter 


phosphatase and tensin 
homolog (mutated in 
multiple advanced cancers 


insulin-like growth factor 
binding protein 7 


phosphodiesterase 3B, 
cGMP-inhibited 


Krueppel-related zinc 
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Description 


finger protein 


NAD(P) dependent steroid 

dehydrogenase-like; 

H105e3 


Homo sapiens mRNA; 
cDNA DKFZp586F2224 
(from clone 
DKFZp586F2224) 


KIAA0335 gene product 


tubulin, alpha 2 


recombination activating 
gene 2 


PHD finger protein I 


interleukin 1 receptor, type 


recombination activating 
genel 


stress-induced- 
phosphoprotein 1 
(Hsp70/Hsp90-organizing 
protein) 


inositol 1,4,5-triphosphate 
receptor, type 1 


IL2-inducible T-cell kinase 


BarH-like homeobox 2 


tankyrase, TRFl- 
interacting ankyrin-related 
ADP-ribose polymerase 
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Description 


hemopoietic cell kinase 


HCGn-7 protein 


mitogen inducible 2 


paternally expressed 10 


ESTs 


protease, serine, 7 
(enterokinase) 


1 KIAA0633 protein 


i leukocyte immunoglobulin- 
like receptor, subfamily B 
(with TM and ITIM 

1 domains), member 2 


profilin 2 


Homo sapiens mRNA; 
cDNA DKFZp5860l318 
(from clone 
DKFZp5860l318) 


MAD (mothers against 
decapentaplegic, 
Drosophila) homolog 1 


chondroitin sulfate 
proteoglycan 2 (versican) 


ferrochelatase 
(protoporphyria) 


transcription factor-like 5 
(basic helix-loop-helix) 


H factor (complement)-like 
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Description 


runt-related transcription 
factor 3 


immunoglobulin lambda- 
like polypeptide 1 


AD024 protein 
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Description 


small inducible cytokine A5 
(RANTES) 


Cdc42 effector protein 3 


Homo sapiens cDNA FLJ30991 
fis, clone HLUNG 1000041 


secreted protein, acidic, 
cysteine-rich (osteonectin) 


KIAA0867 protein 


B-cell associated protein 


AD024 protein 


interleukin 10 receptor, alpha 


syntaxin lA (brain) 


phytanoyl-CoA hydroxylase 
(Refsum disease) 


Homo sapiens CD44 isoform 
RC (CD44) mRNA, complete 
cds 


ribosomal protein, large, PO 


hypothetical protein FLJ20274 


hypothetical protein FLJ20154 
(GO) 


potassium intermediate/small 
conductance calcium-activated 
channel, subfamily N, member 1 


hypothetical protein 


KIAA0022 gene product 


aminolevulinate, delta-, synthase 


BCL2-associated X protein 
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Description 


azurocidin 1 (cationic 
antimicrobial protein 37) 


vesicle-associated membrane 
protein 2 (synaptobrevin 2) 


tumor suppressing 
subtransferable candidate 3 


Kelch-like ECH-associated 
protein 1 


KIAA0182 protein 


telomeric repeat binding factor 2 


erythropoietin receptor 


collapsin response mediator 
protein 1 


Sm protein F 


"origin recognition complex, 
subunit 5 (yeast homolog)-like" 


DKFZp566D133 protein 


"protein tyrosine phosphatase, 
receptor type, D" 


TRAF family member- 
associated NFKB activator 


"BTG family, member 3" 


phosphatase and tensin homolog 
(mutated in multiple advanced 
cancers I) 


KIAA0080 protein 


"dTDP-D-glucose 4,6- 
dehydratase" 


transcription factor-like 5 (basic 
helix-loop-helix) 
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Description 


"proteoglycan 1, secretory 
granule" 


"ribosomal protein S4, Y- 
linked" 


hypothetical protein FLJl 1 191 


paternally expressed 10 


Homo sapiens mRNA; cDNA 
DKFZp564B076 (from clone 
DKFZp564B076) 


"proteasome (prosome, 
macropain) 26S subunit, non- 
ATPase, 7 (Mov34 homolog)" 


modulator recognition factor I 


myosin X 


poly (ADP-ribose) 
glycohydrolase 


myotubularin related protein 8 


HCGII-7 protein 


Homo sapiens mRNA; cDNA 
DKFZp586F2224 (from clone 
DKFZp586F2224) 


cyclin C 


ELK4, ETS-domain protein 
(SRP accessory protein 1) 


LPS-induced TNF-alpha factor 


interleukin 4 receptor 


putative membrane protein 


SEC14(S. cerevisiae)-like 1 


nuclear factor of activated T- 
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Description 


cells 5, tonicity-responsive 


NCK adaptor protein 1 


nucleotide-sugar transporter 
similar to C. elegans sqv-7 


! hepatoma-derived growth factor 
(high-mobility group protein 1- 
llike) 


heat shock 70kD protein 9B 
(mortaiin-2) 


KIAA0008 gene product 


jagged 1 (Alagille syndrome) 


nucleobindin 1 


guanine nucleotide binding 
protein 1 1 


H factor (complement)-like 3 


synaptic nuclei expressed gene 
lb 


solute carrier family 31 (copper 
transporters), member 1 


protease, serine, 7 (enterokinase) 


zinc finger protein 145 
(Kruppel-like, expressed in 
promyelocytic leukemia) 


NO_.SlF_seq 


KIAA0716 gene product 


wingless-type MMTV 
integration site family, member 
5A 
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Description 


type 1 tumor necrosis factor 
receptor shedding 
aminopeptidase regulator 


GRB2-related adaptor protein 2 


KIAA0808 gene product 


NO_.SIF_seq 


angiotensin receptor 1 


sorting nexin 4 


cyclin-dependent kinase 5 
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SUMMARY 




[Proteome 
FUNCTION:] FYN- 
binding protein; 
modulates 
interleukin 2 
production 


GENE NAME 


DKFZP564M182 
protein 


FYN binding 
protein (FYB- 
120/130) 


OMIM 




602731 


GENBANK 


AK025446, 
AK025446, 
AL049999, All 
Genbank 
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[SUMMARY:] Cell 
surface 
carbohydrates 
modulate a variety 
of cellular functions 
and are typically 
synthesized in a 
stepwise manner. 
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[SUMMARY:] This 
gene encodes a 
member of the 
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glutathione S- 
transferase-like 
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family. In mouse, 
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stress response 
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CoA. It interacts 
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[SUMMARY:] 
Glycophorin C 
(GYPC) is an 
integral membrane 
glycoprotein. It is a 
minor species 
carried by human 
erythrocytes, but 
plays an important 
role in regulating 
the mechanical 
stability of red cells. 
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glycophorin C 
mutations have 
been described. 
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glycophorin D, 
result from single 
point mutations of 
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[Proteome 
FUNCTION:] 
Cancer-testis 
antigen 
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EXAMPLE XII 

Gene Expression Profiling of Pediatric Acute Lymphoblastic Leukemia Reveals 
Unique Subgroups Not Predicted by Current Genetic Risk Stratification 

5 

Summary 

Current ALL classification schemes mask inherent biologic predictors of 
outcome. Classification schemes that reflect the underlying biology of this disease 
could guide patients to more tailored treatments. To develop gene expression-based 

10 classification schemes related to the pathogenic basis of pediatric lymphoblastic 
leukemia, gene expression pattems observed in the statistically designed cohort 
containing 254 pediatric acute lymphoid leukemia (ALL) cases described in Example 
lA were examined using Affymetrix U95AV2 oligonucleotide microarrays. 
Additionally, in order to model remission vs. failure conditioned to predictive 

15 cytogenetics, matched patients were selected among all major genetic prognostic 
groups {MLUAF4, BCR/ABL, E2A/PBX1, TEUAMLJ, hyperdiploidy, and 
hypodiploidy). 

The data were analyzed for class discovery using unsupervised clustering 
methods (hierarchical clustering and a force directed algorithm) and for class 

20 prediction using supervised leaming techniques including Bayesian Nets, Fisher's 

Discriminant, and Support Vector Machines. During initial exploratory data analysis, 
several distinct clusters were observed using unsupervised clustering methods. 
Interestingly, no correlation between the currently employed risk classification groups 
and these clusters was evident. In particular, ALL cases characterized by accepted 

25 "good" and "poor" risk genetics were distributed differentially among the identified 
clusters. This class discovery analysis indicates a more complex intrinsic genetic and 
biologic background in pediatric ALL than cvirrently appreciated. 

Gene expression profiles associated with achievement of remission vs. 
treatment failure were then sought using supervised leaming techniques. Derived 

30 predictive algorithms were applied to a training set of the data. Their performance 
was evaluated with multiple cross validation and bootstrap runs, with an average 
accuracy of 72% and low variance. These models are being tested on the validation 
set. The results provide evidence of additional heterogeneity of pediatric ALL, which 
may relate to novel transformation pathways and clinical outcomes. 
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Data Analysis 

The analysis of the gene expression data was done in a two-step approach. 
First, in order to identify potential clusters and inherent biologic groups, a large 
5 number of clinical co-variables were correlated with the expression data using 

unsupervised clustering methods such as hierarchical clustering, principal component 
analysis and a force-directed clustering algorithm coupled with a novel visualization 
tool {Vxinsight), For class prediction, supervised learning methods such as Bayesian 
Networks, Support Vector Machines with Recursive Feature Elimination (SVM- 
10 RFE), Neuro-Fuzzy Logic and Discriminant Analysis were employed to create 
classification algorithms. The performance of these classification algorithms was 
evaluated using fold-dependent leave-one-out cross validation (LOOCV) techniques. 
These methods combined allowed the identification of genes associated with 
remission or treatment failure and with the different translocations across the dataset. 

15 

Results 

To explore potential clusters driven by gene expression profiles, the initial 
analysis of the pediatric ALL cohort was accomplished using a force directed 
clustering algorithm coupled with a novel visualization tool, Vxinsight as described in 

20 Example IB. Unexpectedly, we discovered 9 novel biologic clusters of ALL (2 

distinct T-cell ALL clusters (SI and S2) and 7 (2 related clusters are seen in cluster X) 
distinct B-lineage ALL clusters (A, B, C, X, Y, Z)) each with distinguishing gene 
expression profiles. Using ANOVA, we identified over 100 statistically significant 
genes uniquely distinguishing each of these cohorts; a list of the top statistically 

25 significant genes distinguishing each cluster is provided in Table 43. Review of these 
lists of genes reveals many interesting signaling molecules and transcription factors. 
The X cluster (which contains two highly related clusters) is quite unique in having 
expression of several genes regulating methylation and folate metabolism. 

Examination of the cluster data reveals that while there are some trends, no 

30 cytogenetic abnormality precisely defines or is correlated with any specific cluster. It 
is interesting that cases with a t(12;21) or hyperdiploidy, both conferring low risk and 
good outcomes, tend to cluster together; although combinations of these cases can be 
seen primarily in clusters C and Z as well as the top component of the X cluster 
indicating that there is still heterogeneity in gene expression profiles associated with 
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these clusters. On the terrain map from Vxinsight (Fig. 6, top) these three cluster 
regions (C, Z, and X) are actually fairly closely approximated indicating they are 
more related than for example cluster C to cluster S2. Although our correlations 
between outcome and clusters are still underway, it is interesting that the hyperdiploid 
5 and t(12;21) cases in cluster X had a significantly poorer outcome than those in 

cluster C or Z, suggesting that these cluster groupings may reflect different biologic 
propensities that confer differing responses to therapy. Similarly, the t(l ;19) cases 
clustered in Y had a poorer outcome than those in clusters A and B. Finally, it is of 
interest that ALL cases with t(9;22) simply don't cluster, they appear to be distributed 

10 among virtually all B precursor clusters. While we do not understand the significance 
of this result, it suggests that the t(9;22) is a pre-leukemic or initiating genetic lesion 
that may not be sufficient for leukemogenesis, or alternatively, that clones with a 
t(9;22) are quite genetically unstable and transformation and genetic progression may 
occur along many pathways. Results similar to our own were recently reported by 

15 Fine et al. (Blood Abstract, Blood Supplement 2002 (753a, Abstract #2979)). Using 
hierarchical clustering on a small series of 35 cell lines and ALL cases, these 
investigators found a limited correlation between intrinsic biologic clusters in ALL 
and cytogenetic abnormalities; cases with a t(9;22) were found to be particularly 
heterogeneous in their gene expression profiles. 

20 The stability and structure of the clusters was explored using methods of data 

perturbation. Because the clusters appeared to be steady, subsequent exploration of 
the group-characterizing genes was performed using analysis of variance (ANOVA). 
This method was applied to order all of the genes with respect to differential 
expressions between the groups. The strongest 0.1% of the genes were tabulated in 

25 lists. The strength of these gene lists was studied using statistical bootstrapping as 
described in Example IB, and suggested that the identified groups represented well- 
separated patient subclasses. 

Surprisingly, with the exception of the T-ALL cases (clusters Si and S2), the 
clustering of ALL patients was independent of karyotype, suggesting that common 

30 tumor genetics, as currently applied to prognostic schema, do not strongly influence 
or drive innate expression profiling in pediatric ALL. However, fewer "adverse 
prognosis" genetics were distributed among certain clusters (e.g. C and Z). 
Remarkably, patients with translocations such as t(9;22)/BCR-ABL, 
t(\\\9)IE2A/PBXl, and X{\2\2\)ITEL/AMLl were distributed among several clusters. 



suggesting biologic heterogeneity beyond the present tendency to group these various 
entities for the purpose of prognosis and outcome prediction. The results of these class 
discovery methods suggested that, when applied to our patient data set, unsupervised 
techniques elucidate underlying novel subgroups pediatric ALL. In turn, this 
5 reassessment of tumor heterogeneity encourages the design of additional studies to 
ascertain whether these data can enhance the discriminatory power of currently 
employed prognostic variables. 

Analysis was therefore next focused on class prediction. The process of 
defining the best set of discriminating genes between known subsets of samples can 

10 be accomplished using supervised learning techniques such as Bayesian Networks, 
linear discriminant analysis and support vector machines (SVM). In contrast to 
unsupervised methods that generate inherent "classes" for each gene or patient, 
supervised lefuning methods are trained to recognize "known classes", creating 
classification algorithms that may also uncover interesting and novel therapeutic 

15 targets. 

Genes that best discriminated T-lineage ALL from B-lineage ALL were 
identified using principal component analysis and ANOVA of the cluster- 
differentiating genes generated from the Vxinsight analysis. Significant overlap was 
observed between the 2 methods used in our analysis of the T-cell ALL gene 

20 expression profile, as well as with published data (Yeoh et al.. Cancer Cell 1 ; 133- 
143, 2002), both in the actual presence of the same genes, as well as in relative rank 
(Fig. 7). Importantly, this is evident across data sets and regardless of analytic 
approach for T-cell ALL, suggesting that these genes define important features of T- 
ALL biology. It also implies that T-ALL gene expression is inherently "less 

25 complex" in delineating this leukemic entity, than for B-lineage ALL. 

Gene expression profiles characteristic of translocation types were derived 
using supervised learning techniques. 147 genes derived from Bayesian network 
analysis that allowed the identification of samples within each of the major 
translocation groups with accuracy rates higher than 90%, as calculated by fold 

30 dependent leave-one-out cross validation. This filtered data analysis of gene 

expression conditioned on karyotype generated distinct case clustering, confirming 
that unique gene expression "signatures" identify defined genetic subsets of ALL. 
This corroborates recently published data (Yeoh et al.. Cancer Cell 1; 133-143, 2002) 
which revealed that karyotypic sub-groups of ALL are characterized by specific gene 
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expression profiles (Fig. 8). Unsupervised methods do not clearly identify clusters of 
patients by therapeutic outcome. Nonetheless, some clusters (e.g. C, Y, SI) contain a 
greater number of remission cases. When the clusters are examined for remission 
versus failure by karyotype, it is evident that there is only minimal correlation 
5 between the distribution of prognostically important tumor genetics and outcome. For 
example, while clusters C and Z have similar distributions of case number and 
karyotypic sub-types, more C group patients achieved remission. Cluster Y, which 
harbors a greater proportion of adverse prognosis genetic types, unexpectedly 
demonstrates a relatively high percentage of remission cases. These findings imply 

10 that the biology of clinical outcome in pediatric ALL is more complex than previously 
appreciated and is not readily determined by the relatively gross examination of tumor 
cytogenetics. These data thus support the observation that relapse in pediatric ALL 
occurs regardless of NCI clinical risk category, or current genetic risk modifiers. It is 
notable that gene expression analysis identifies 2 sub-populations of T- ALL, one of 

1 5 which (SI) demonstrates a favorable therapeutic outcome. 

Comparison with method and results ofYeoh et al (Cancer Cell 1; 133-143, 2002) 
Yeoh et al., in a study performed on the "Downing" or "St. Jude" data set as 
described above, reported that pediatric ALL cases clustered according to the 

20 recurrent cytogenetic abnormalities associated with ALL, and thus, that cytogenetics 
could define these intrinsic groups. However, careful reading of this report and the 
methods of analysis employed reveals that these investigators did not perform and/or 
report the results of true unsupervised learning methods and class discovery. Rather, 
these investigators first used supervised leaming algorithms (primarily Support 

25 Vector Machines) to identify short lists of expressed genes that were associated with 
each recurrent cytogenetic abnormality in ALL. Using a highly selected set of only 
271 genes that resulted fi-om this supervised leaming approach, they then performed 
hierarchical clustering or PCA using the expression data derived fi"om only this set of 
selected genes. As would be expected fi-om this approach, distinct ALL clusters could 

30 be defined based on shared gene expression profiles and each cluster was associated 
with a specific cytogenetic abnormality. However, this approach did not reveal what 
the underlying structure was in the gene expression profiles if one took a truly 
unbiased approach and performed real class discovery. 
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Furthermore, although Yeoh et al. attempted to use supervised learning 
methods to identify genes associated with outcome, they were not successful. 
Potential outcome genes identified in training sets could not be confirmed in 
independent test sets, indicating that the learning algorithms employed were "over- 
fitting" the data - a not imcommon problem with supervised learning algorithms. 
Another potential problem with these studies was that was no statistical design for the 
cases selected for study in this St. Jude cohort; cases were selected simply based on 
sample availability. Thus, in contrast to our retrospective POG cohort design in 
which cases with long term remission were balanced roughly 50:50 v^th cases that 
failed, the St. Jude cases were predominantly cases with long term remission (>80%), 
making the modeling in the St. Jude dataset far more difficult. We have come to 
appreciate is how important statistical design and case selection is to any array study 
(indeed for any scientific study) and that for supervised learning algorithms and class 
prediction, it is very important to have the label that one is trying to predict (such as 
outcome or the presence of a particular genetic abnormality) balanced 50:50 in the 
cohort undergoing modeling and within the training and test sets. 
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EXAMPLE XIII 

Gene Expression Profiling for Molecular Classification and Outcome Prediction in 
Infant Leukemia Reveals Novel Biologic Clusters, Etiologies and Pathways for 

Treatment Failure 

5 

To determine if traditional biologic and clinical subgroups of infant leukemia 
cases could be identified by gene expression profiles, 126 infant leukemia cases 
registered to NCI-sponsored Infant Oncology Group/Children's Oncology Group 
treatment trials were studied using oligonucleotide microarrays containing 12,625 

10 probe sets (Affymetrix U95Av2 array platform). Of the 126 cases, 78 were ALL 
(62%), 48 were AML (38%) and 53 (42%) cases had translocations involving the 
MLL gene (chromosome segment 1 lq23). 

The exploratory evaluation of our data set was performed in several steps. The 
first step of the analysis was the construction of predictive classification algorithms 

1 5 that linked the gene expression data to the traditional clinical variables that define 
treatment, using supervised learning techniques, and further, the exploration of 
patterns that could predict patient outcomes. As described in Example I A, the 126 
patients were divided into statistically balanced and representative training (82 
patients) and test sets (44 patients), according to the clinical labels (leukemia lineage, 

20 cytogenetics and outcome). For classification purposes, two primary supervised 
approaches were used; Bayesian networks and recursive feature elimination in the 
context of Support Vector Machines (SVM-RPE). Additional classification 
techniques (Fuzzy inference and Discriminant Analysis) were used for comparison 
purposes. 

25 All of the classification algorithms were established based on the training data 

set and then used to predict the class of the samples in the test. Two statistical 
significance tests were employed to further evaluate the prediction accuracy of those 
algorithms. The first tested whether the success rate of each classification algorithm 
was significantly greater than the value that would be expected by chance alone (i.e. 

30 whether the success rate was significantly greater than 0.5, where the success rate = # 
of correct predictions / total predictions). The second prediction accuracy test used the 
true positive proportion (TP) and false positive proportion (FP) value computed for 
one of the two classes. For a binary classification problem, TP is the ratio of correctly 
classified samples in the class to the total number in the class. FP is the proportion of 
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misclassified samples in the other class to the total number in that class. To test 
whether the true positive proportion was significantly greater than the false positive 
proportion, we used Fisher's exact test. The p-values of the two tests along with the 
success rates for each of the classification algorithms with respect to the classification 
5 tasks of interest are listed in Table 44. As shown in the table, both evaluation methods 
confirmed that the classification results for the lineage labels (ALL/AML) and the 
presence or absence of t(4;l 1) rearrangements were significant at level a=0.05. In 
other words, all the supervised learning techniques employed were successful in 
finding a distinction between ALL and AML samples, and the presence/absence of 
10 t(4;l 1) rearrangements. Detailed gene lists that characterize each one of these 

leukemia subtypes were obtained from all the classifiers used and can be found in the 
Supplemental Information. 

Class discovery: Expression profiles partition infant leukemia cases in three groups 

15 To explore the intrinsic structure of the data independent of known class 

labels, several unsupervised clustering methods were employed. These unsupervised 
approaches allowed patient separation into potential clusters based on overall 
similarity in gene expression, without prior knowledge of clinical labels. As discussed 
below, although certain degree of correlation of our unsupervised clusters with 

20 traditional lineage (ALL/AML) and cytogenetics (MLL or not) could be observed, 
those labels were not enough to completely explain the results of our unsupervised 
clustering methods, suggesting that leukemia lineage and cytogenetics are not the only 
important factors in driving the inherent biology of these gene expression groups. 
Initially, the data were investigated using agglomerative hierarchical 

25 clustering (Eisen et aL, 1998). Hierarchical clustering results from the 126 infant 
leukemia samples using all genes yielded several groups that seemed to have no 
relation to the known lineage labels or the partition of the data suggested by the 
presence or absence of MLL rearrangements (see supplemental information). 

The next technique used was Principal Component Analysis (PCA). PCA, 

30 closely related to the Singular Value Decomposition (SVD), is an unsupervised data 
analysis method whereby the most variance is captured in the least number of 
coordinates (Joliffe, 1986; Kirby, 2001; Trefethan & Bau, 1997). As shown in Fig. 9, 
the first three principal components can be seen to partition the infant cohort into two 
different groups. These groups capture the infant ALL/AML lineage distinction, but 

200 



only weakly agree with the MLL cytogenetics. Specifically, there is a 92% agreement 
between the PCA and the ALL/AML labels and only a 65% agreement between the 
PCA and MLL/non-MLL labels. Unexpectedly, the ALL/AML distinction does not 
appear until the second principal component, suggesting that morphology is not the 
5 most important factor explaining the variance in our data set. However, the first (and 
most important) principal component does not reveal any obvious clusters. Upon 
further analysis with a force-directed graph layout algorithm, we foimd the additional 
group (discussed later) seen only in the first principal component (colored in blue in 
Fig. 9). 

10 The force-directed clustering algorithm (Davidson et al.^ 1998; 2001) places 

patients into clusters on the two-dimensional plane by minimizing two opposing 
forces. Briefly, the algorithm forms groups of patients by iteratively moving them 
toward one another with small steps proportional to the similarity of their gene 
expression, as measured by Pearson's correlation coefficient. To avoid collecting all 

15 of the patients into a single group, a counteracting force pushes nearby patients away 
from each other. This force increases in proportion to the number of nearby patients 
and has a strong local effect, thus acting to disperse any concentrated group of 
patients. This force affects only patients who are near each other, while the attractive 
force (Pearson's similarity) is independent of distance. The algorithm moves patients 

20 into a configuration that balances these two forces, thus grouping patients with similar 
gene expression. The spatial distribution of patients is then visualized on a three- 
dimensional plot, similar to a terrain map, where the height of the peaks denotes the 
local density of patients. This method has been useful in inferring functions of 
uncharacterized genes clustered near other genes with known functions (Kim, 2001) 

25 and for the analysis and mapping of various databases (Davidson, 1998, Wemer- 
Washbume, 2002) 

When applied to the infant data, the Vxhisight clustering algorithm identifies 
several pattern of gene expression across the patients, suggesting the existence of 
three major groups (Fig. 10, and row three in Fig. 9), which hereafter will be denoted 
30 clusters A, B, and C. Despite different means of data transformation and different 
underlying mathematics, a high degree of overlap (92%) was observed between the 
clusters derived from PCA and the B and C clusters identified through the clustering 
algorithm native to Vxinsight®. In addition, when the A group is displayed in the 
PCA projections (as seen in row three of Fig. 9), we see that it is distinguished from 
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the B and C clusters in the first principal component. This lends additional support to 
the existence of and the importance of the A group. 

Several further explorations into the Vxinsight clusters were pursued. Linear 
discriminant analysis was used to separate the three clusters. The object of 
5 discriminant analysis is to weight and linearly combine information from the feature 
variables in a manner that clearly distinguishes labeled subclasses of the data. More 
specifically, the idea is to find a linear function of the feature variables such that the 
value of this function differs significantly between different classes. This function is 
the so-called discriminant function. Then, ANOVA was performed to rank cluster- 

10 discriminating genes in term of their F-test statistic values. From the top genes, a 
subset of genes was selected using stepv^se discriminant analysis. This subset of 
genes served as the discriminating variables needed by linear discriminant analysis. 
The error rate of the derived classification results was 0.03, as estimated using fold- 
independent leave one out cross-validation (LOOCV). This indicated that the three 

1 5 Vxinsight clusters were well separated. 

There was also support for the existence of the Vxinsight groupings even 
when only a subset of the data was used. For example, three widely separated groups 
of patients were observed when using only the patients in the training set. The 
addition of the rest of the patients in the test set, however, did induce change. In 

20 particular, the cores of Groups A and Groups C remained separated while Group B 

increased to include marginal members of groups A and C. The observation of similar 
grouping in both the entire set and the training set alone increased our interest in 
disceming the force driving the clustering for the patients in the Vxinsight groups. 

Finally, we confirmed our ability to classify patients into the Vxinsight groups 

25 A, B, and C. Such a demonstration showed that we could categorize new patients into 
our grouping in the future (e.g. for treatment or diagnosis). To accomplish this, a 
multi-class Support Vector Machine (SVM) was trained using the actual labels A, B, 
and C in the patients from the training set. The prediction accuracy of this SVM on 
the test set was 95%. To verify that this result was improbable by chance alone, a 

30 randomization test was also performed. The labels A, B and C were randomly 

reassigned to the patients in both the training and the test set. Then, another SVM was 
trained with the re-labeled data in the training set. This SVM achieved a prediction 
accuracy of only 40% on the test set. 



202 



Subsequent exploration of the cluster-characterizing genes was performed 
using analysis of variance (ANOVA). The F-scores from this method were used to 
order all of the genes with respect to differential expressions between the groups. The 
strongest ranking 100 genes were then tabulated. The stability and strength of these 
5 gene lists was studied using statistical bootstrapping (Efron, 1979; Hjorth, 1994). This 
analysis provided a powerful method for determining the likelihood that a gene (high 
on the gene list determined from the actual data) would remain near the top of any 
gene list generated from experimental data similar to that which we actually observed. 
While this method allowed the identification of genes that had a unique pattem in 

10 each cluster and defined inter-clusters differences, it is important to make a distinction 
between these genes and the ones active in each one of the clusters (See supplemental 
information). Some very surprising findings were uncovered after completing a 
detailed analysis of the genes responsible for the distinction between clusters. These 
results, together with the stability of the clusters, suggest that the identified groups 

1 5 represent well-separated patient subclasses. 

Approaches to inherent biology 

Expression profiles identified different clusters of infant leukemia cases, not 
related to type labels or cytogenetics, but characterized by different genes 

20 predominantly expressed in, and probably related to, three independent disease 

initiation mechanisms. The sets of cluster-discriminating genes can be used to identify 
each biologic group and hence represent potentially important diagnostic and 
therapeutic targets (See Table 45). A heat map/dendrogram was produced with the top 
30 genes that characterized each one of the three clusters, generated from the 

25 ANOVA analysis. Analysis of these genes revealed pattems that imply different 
features with potential clinical relevance. 

The top cluster of cases (Fig. 10, cluster A, n=20, 15 ALL cases and 5 AML 
cases) has a gene expression profile that would not be recognized as "leukemic" per 
se. The cases in this cluster are distinguished by high expression of genes such as the 

30 novel tumor suppressor gene (ST5), embryonal antigens, adhesion molecules 
(particularly integrin ot3), growth factor receptors for numerous lineages 
(keratinocytes and epithelial cells, hepatocytes, neuronal cells, and hematopoietic 
cells) and genes in the TGFBl signaling pathway. The TGFB cytokines modulate the 
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growth and functions of a wide variety of mammalian cell types. TGFB inhibits the 
proliferation of most types of cells. Proteins such as the latent transforming growth 
factor beta binding protein 4 (LTBP4), which is over expressed in this group of 
patients, are also regulated by TGFB. (Oklu, 2000). For this particular group of 
5 patients, cluster-discriminant genes such as CD34 (hematopoietic progenitor cell 
antigen), ataxin 2 related protein (responsible for specific stages of both cerebellar 
and vertebral column development), contactin2 (involved in glial development and 
tumori genesis), the ski oncogene (another component of the TGFBl signaling 
pathway) and the erythropoietin receptor, suggest the involvement of an embryonal 

10 "common progenitor" primordial cell. Additionally, despite high expression of the 
above-mentioned characteristic genes, cases in this cluster demonstrated low to 
moderate expression of most genes. These data supports recent reports of stepwise 
decrease in transcriptional accessibility for multilineage-affiliated genes may 
. represent progressive restriction of development potentials in early hematopoiesis 

15 ((Akashi et al.. Blood 2003 Jan 15;101(2):383-9)). As suggested by Akashi et al, the 
size of the "functional genome" may be progressively reduced as hematopoietic stem 
cells undergo differentiation. 

Other genes in this group with an absolutely unique pattern of expression 
include growth inhibitory factors like methallothionein 3 (MTi), embryonic cell 

20 transcription factors (UTFJ) and stem cell antigens (prostate stem cell antigen) with 
remarkable homology to cell surface proteins that characterize the earliest phases of 
hematopoietic development (Reiter, 1 998). 

The left cluster of cases (Fig. 10, cluster B, n=52, 51 ALL cases and 1 AML 
case), is characterized by a high frequency of MLL rearrangements, predominantly 

25 t(4; 11). This group was also distinguished by expression of Ijnnphoid-characterizing 
genes (GDI 9, B lymphoid tyrosine kinase, CD79a) as well as EBV infection-related 
genes and genes associated with, or induced by, other DNA viruses. It is especially 
remarkable to find elevated expression of the Epstein-Barr virus-induced gene 2 
(EBI2) in more than 30% of the cases in this cluster (*82% of this cases have MLL 

30 rearrangements). EBI2 has been reported as one of the genes present in EBV infected 
B-lymphocytes (Birkenbach, 1993). Epstein-Barr virus infection of B lymphocytes, as 
well as infection of Burkitt lymphoma cells, induces an increase in the expression of 
this gene, identifiable by subtractive hybridization. We speculate that this group of 
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cases might be initiated by a viral infection and that secondary, but critical MLL 
translocations stabilize or, alternatively, more fully transform these cells. 

Finally, the third rightmost cluster (Fig. 9, cluster C, n=54, 42 AML cases and 
12 ALL cases) is more heterogeneous and has a broader spectrum of MLL 
5 translocations. The gene expression signature of this group seems to have "myeloid" 
characteristics, with activation of genes previously reported as "myeloid-specific" 
such as Cystatin C (CST3), the myeloid cell nuclear differentiation factor (MNDA), 
and CCAAT/enhancer binding protein delta (C/EBP) (Golub, 1999; Skalnik, 2002). 
Members of the CCAAT/enhancer binding protein (C/EBP) family of transcription 

10 factors are important regulators of myeloid cell development (Skalnik, 2002). Other 
genes useful for cluster C prediction may also provide new insights into infant 
leukemia pathogenesis. For example, the mitogen activated protein kinase-activated 
protein kinase 3 is the first kinase to be activated through all 3 MAPK cascades: 
extracellular signal-regulated kinase (ERK), MAPKAP kinase-2, and Jun-N-terminal 

1 5 kinases/stress-activated protein kinases (Ludwig, 1 996). It has been demonstrated as a 
determinant integrative element of signaling in both mitogen and stress responses. 
MAPKAPK3 showed high relative expression in the patients in cluster C. Many of the 
genes that characterize this cluster encode proteins characteristic of definitive myeloid 
differentiation (NDUFABl, SODl, GSTTLp28), or which are critical for signal 

20 transduction (TYROBP). Interestingly, activation of many DNA repair and GST 
genes was also evident in this group of cases. 

Altogether, the results of our class discovery methods suggested that, when 
applied to our patient data set, unsupervised techniques elucidate underlying novel 
subgroups of infant leukemia cases. In turn, this reassessment of tumor heterogeneity 

25 encourages the design of additional studies to ascertain whether these data can 
enhance the discriminatory power of currently employed prognostic variables. 

Heterogeneous distribution of the MLL cases 

The most common mutations in infant leukemia Eire translocations of the MLL 
30 gene at chromosome band 1 lq23. Interestingly, the MLL cases in cluster A (Fig. 10, 
lower left panel) are primarily t(4;l 1) (n=7), as well as two cases with t(10;l 1) and 
one with t(11;19). Cluster B, composed of virtually entirely ALL cases, contains a 
large number of t(4; 1 1) cases (n=29) as well as four cases with t(l 1 :19), one case of 
t(10;l 1), and one case of t( 1:1 1). Finally, the bottom right cluster (n=54). 
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predominantly AML but containing twelve cases with an ALL label that nonetheless 
have more "myeloid" patterns of gene expression, also comprises five cases with 
t(9;l 1), three cases with t(l ;1 1), three cases with t(l l;19),one case with t(4;l 1) and 
three cases with other MLL translocations. 
S MLL cases with the same translocation (t(4;l 1) in clusters A and B) had 

dramatic differences in their gene expression profiles. The mechanisms that might 
underlie this striking difference are currently under study. Genes that have common 
pattems in the MLL cases across all three clusters have been identified; as well as 
genes that are uniquely expressed and which distinguish each MLL translocation 

10 variant. Although MLL cases are not homogeneous, it is interesting that the list of 
statistically significant genes derived in this study is quite similar to the list of genes 
derived by previous groups working in infant MLL leukemia (Armstrong, 2002). For 
reasons not understood, infants are more prone to MLL rearrangements that inhibit 
apoptosis and cause transformation, (reviewed in Van Limbergen et al, 2002). Our 

15 results suggest that the MLL translocation in these patients may not be the "initiating" 
event in leukemogenesis. It is possible that after a distinct initiating event, the infant 
patient is more prone to rearrange the MLL gene, and that this rearrangement leads to 
fiirther cell transformation by preventing apoptosis. Alternatively, an MLL 
translocation could be a permissive initiating event with leukemogenesis and final 

20 gene expression profile determined more strongly by second mutations. Further 
studies within the MLL group of infant leukemia patients may provide the clues to 
processes determinant in leukemic transformation. 

Pathways to failure in infant leukemia 

25 In general, gene expression data has supported the existence of several 

categories of acute leukemias related to the traditionally defined leukemia types, ALL 
and AML (Golub, 1999; Moos, 2002). However, while expression profiling is a 
robust approach for the accurate identification of knovm lineage and molecular 
subtypes across acute leukemia cases, the search for clinically relevant prognosis 

30 discriminators based on gene expression pattems has been less successfiil (Armstrong, 
2002; Ferrando, 2002; Yeoh, 2002). As shown in Table 46, only SVM-RFE was able 
to identify remission vs. failure across the unconditioned data set with a total error 
rate differing fi-om random prediction (success rate of 64% at a significance level of p 
< 0.1). Interestingly, the performance of our outcome classification algorithms was 
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not increased when conditioned on either of the traditional criterion of lineage (ALL 
vs. AML) nor cytogenetics {MLL vs. not MLL), providing further support for 
questioning the predictive value of these traditional clinical labels in explaining 
outcome in infant patients. However, far greater success in outcome prediction is 
5 obtained when conditioning the classifying algorithms on the Vxinsight cluster 
membership. The effect of the three Vxinsight clusters on oxir ability to predict 
remission vs. failure was then explored. In particular, we attempted to predict 
remission vs. failure in the entire data set, conditioned on the knowledge of into which 
Vxinsight cluster each case falls. The hope was that, by utilizing knowledge of 

10 Vxinsight cluster membership, inter-cluster expression profile variability of cases - 
which is not necessarily relevant to outcome prediction- would be eliminated, 
allowing intra-cluster variability relevant to outcome prediction to be more easily 
discovered by our classification algorithms. 

Table 46 demonstrates that prediction accuracy is gained by coupling the 

15 supervised learning algorithms with Vxinsight clustering. In the Bayesian method, 
accuracy against the test set rises from 0.568 (p=0.256) to 0.703 (p=0.010). Smaller 
improvements after conditioning are found with the other methods as well. One can 
look also at the prediction accuracy within the Vxinsight clusters individually. There 
again a general rise in accuracy is observed, though not to a level of statistical 

20 significance, possibly due to the small size and/or class balance of the individual 
clusters. 

We note that, from the more abstract perspective of machine learning theory, 
the construction of the Vxinsight clusters is viewed as an external feature creation 
algorithm that is applied to a data set before the supervised learning algorithms begin 

25 their training. In the application at hand, the created feature is 3-valued, indicating 
membership of a case in Vxinsight cluster A, B, or C. This feature creation process is 
akin to the pre-selection of features, based on measures of information content, that is 
employed by many supervised learning algorithms when run on problems of high 
dimensionality. One difference between the Vxinsight feature creation step and 

30 traditional feature selection is that Vxinsight clustering is performed without 

knowledge of the class label to be predicted (outcome, in this context), and hence it is 
reasonable to perform the clustering on the entire data set (train and test sets 
combined) at once. 
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The relative strength of the gene lists and parent sets can be thought of as 
being correlated with the prediction accuracy within the corresponding Vxinsight 
cluster. However, it is the application of the lists and parent sets together within the 
two-step Vxinsight / supervised learning conditioning framework described above 
S that achieves statistical significance in its accuracy. 

It is rather unlikely that random chance alone would improve such accuracy 
levels, since a process independent of the best error rate generated the Vxinsight 
clustering. These results are taken as strong evidence that the Vxinsight patient 
clusters reflect biologically important groups and, are clinically exploitable. In 

10 contrast, comparable accuracy was not achieved by conditioning on either of the 

traditional criteria of ALL vs. AML, nor MLL vs. not MLL. This may indicate that, as 
determined by our molecular analysis, these traditional clinical criteria for segregating 
treatment cohorts are less defining than has been supposed. 

Table 47 illustrates the resulting set of distinguishing genes associated with 

1 5 remission/failure in the overall data set (not partitioning by type, cytogenetics or 
cluster), which represent potentially important diagnostic and therapeutic targets. 
Some of these outcome-correlated genes include Smurfl , a new member of the family 
of E3 ubiquitin ligases. Smurfl selectively interacts with receptor-regulated MADs 
(mothers against decapentaplegia-related proteins) specific for the BMP pathway in 

20 order to trigger their ubiquitination and degradation, and hence their inactivation. 

Targeted ubiquitination of SMADs may serve to control both embryonic development 
and a wide variety of cellular responses to TGF-p signals. (Zhu, 1999). Another 
interesting gene is the SMA- and MAD-related protein, SMADS, which plays a 
critical role in the signaling pathway in the TGF-P inhibition of proliferation of 

25 human hematopoietic progenitor cells (Bruno, 1998). The list also included regulators 
of differentiation and development; bone morphogenetic 2 protein, member of the 
transforming growth factor-beta (TGF-p) super family and determinant in neural 
development (White, 2001); DYRKl, a dual-specificity protein kinase involved in 
brain development (Becker, 1998); a small inducible cytokine A5 (SCYA5), the T cell 

30 activation increased late expression (TACTILE), and a myeloid cell nuclear 

differentiation antigen (MNDA). It is remarkable that this list includes potential 
diagnostic or therapeutic targets like the ERG oncogene (V-ETS Avian 
Erythroblastosis virus E26 oncogene related, found in AML patients), the 
phospholipase C-like protein 1 (PLCL, tumor suppressor gene), a cystein rich 
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angiogenic inducer (CYR61), and the MYC, MYB oncogenes. Other genes in the list 
are located in critical regions mutated in leukemia, which suggests their connection 
with the leukemogenic process. Such genes include Selenoprotein P (SPPl, 5q), the 
protein kinase inhibitor p58 (DNAJC3 in 13q32), and the cyclin C (CCNC). 

5 

Discussion 

Traditionally, infant leukemia has been classified according to a host of 
clinical parameters and biological features that tend to correlate with prognosis. This 
classification system has been used for risk-based classification assignment. However, 

10 unexplained variability in clinical courses still exists among some individuals within 
defined risk-group strata. Differences in the molecular constitution of malignant cells 
within subgroups may help to explain this variability. 

In our initial profiling of 126 infant acute leukemia cases, we have used 
microarray technology to both segregate patient subgroups and to uncover genetic 

1 5 diversity among patients that fall within the same traditional risk groups. The results 
reported here identify three previously unrecognized groups of infant leukemia cases, 
driven by differential gene expression pattern and possibly related to three 
independent disease initiation mechanisms. Two of these clusters support previous 
data about leukemic etiology: environmental exposure and viral infections, both of 

20 which may occur in utero. 

Our data also supports the existence of a third group, with a particular gene 
expression pattern suggestive of a novel stem cell neoplasia with leukemic behavior. 
The genes expressed in most of these cases resemble those present in the 
hematopoietic/angioblastic primordial cell (Young, 1995; Eichman, 1997); see for 

25 example. Figs. 1 1 and 12. This subgroup may be therapeutically relevant and may 
also provide additional evidence for the existence of a common progenitor, possibly 
the primordial hematopoietic/endothelial cell. The gene expression blueprint of this 
cluster seems to characterize a imique and distinct subclass of infant leukemia that 
represents transformed, true multi-potent stem cells or "cancer stem cells". There is an 

30 important body of work suggesting that normal hematopoietic stem cells may be 

target of transforming mutations and that cancer cell proliferation is driven by cancer 
stem cells (Reya, 2001). Our data provides further evidence in support of the 
hypothesis that newly arising cancer cells may appropriate the machinery for self- 
renewing cell divisions, which is normally expressed in stem cells. 
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Together, these results indicate the occurrence of, at least, three inherent 
biological subgroups of infant leukemia, not precisely defined by traditional AML vs. 
ALL or cytogenetics labels; probably driven by characteristics with potential clinical 
relevance. Consideration of these three categories may enable selection criteria for 
5 more powerful clinical trials, and might lead to improved treatments with better 
success rates. 

METHODS 

To develop gene expression-based classification schemes related to the 
10 pathogenic basis underlying the leukemic process in infant acute leukemia, 126 
patients registered to NCI-sponsored Infant Oncology Group/Children's Oncology 
Group treatment trials were examined using Affymetrix U95Av2 oligonucleotide 
microarrays containing 12,625 probes. Of the 126 cases, 78 were ALL (62%), 48 
were AML (38%) and 56 (44%) cases had translocations involving the MLL gene 
15 (chromosome segment 1 lq23). An average of 2 x lO'^ cells were used for total RNA 
extraction with the Qiagen RNeasy mini kit (Valencia, CA). The yield and integrity of 
the purified total RNA were assessed with the RiboGreen assay (Molecular Probes, 
Eugene, OR) and the RNA 6000 Nano Chip (Agilent Technologies, Palo Alto, CA), 
respectively. Complementary RNA (cRNA) target was prepared from 2.5 jig total 
20 RNA using two rounds of Reverse Transcription (RT) and In Vitro Transcription 

(IVT). Following denaturation for 5 minutes at 70^C, the total RNA was mixed with 
1 00 pmol T7- (dT) 24 oligonucleotide primer (Genset Oligos, La Jolla, CA) and 

allowed to anneal at 42^C. The mRNA was reverse transcribed with 200 units 
Superscript II (Invitrogen, Grand Island, NY) for 1 hour at 42''C. After RT, 0.2 vol. 

25 5X second strand buffer, additional dNTP, 40 imits DNA polymerase I, 10 units DNA 
ligase, 2 units RnaseH (Invitrogen) were added and second strand cDNA synthesis 
was performed for 2 hours at 16°C. After T4 DNA polymerase (10 units), the mix 
was incubated an additional 10 minutes at 16°C. An equal volume of 
phenol :chloroform:isoamyl alcohol (25:24:1) (Sigma, St. Louis, MO) was used for 

30 enzyme removal. The aqueous phase was transferred to a microconcentrator 

(Microcon 50. Millipore, Bedford, MA) and washed/concentrated with 0.5 ml DEPC 
water twice the sample was concentrated to 10-20^1. The cDNA was then transcribed 
with T7 RNA polymerase (Megascript, Ambion, Austin, TX) for 4 hours at 37°C. 
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Following IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, washed 
and concentrated to 1 0-20jil. The first round product was used for a second round of 
amplification which utilized random hexamer and T7- (dT) 24 oligonucleotide 

primers. Superscript II, two RNase H additions, DNA poljonerase I plus T4 DNA 
5 polymerase finally and a biotin-labeling high yield T7 RNA polymerase kit (Enzo 
Diagnostics, Farmingdale, NY). The biotin-labeled cRNA was purified on Qiagen 
RNeasy mini kit columns, eluted with 50^1 of 45*^C RNase-free water and quantified 
using the RiboGreen assay. Following quality check on Agilent Nano 900 Chips, 
ISjig cRNA were fragmented following the Affymetrix protocol (Affymetrix, Santa 

10 Clara, CA). The fragmented RNA was then hybridized for 20 hours at 45''C to 

HG_U95Av2 probes. The hybridized probe arrays were washed and stained with the 
EukGE_WS2 fluidics protocol (Affymetrix), including streptavidin phycoerythrin 
conjugate (SAFE, Molecular Probes, Eugene, OR) and an antibody amplification step 
(Anti-streptavidin, biotinylated, Vector Labs, Burlingame, CA). HG_U95Av2 chips 

1 5 were scanned at 488 nm, as recommended by Affymetrix. The expression value of 
each gene was calculated using Affymetrix Microarray Suite 5.0 software. 

Data Presentation and Exclusion Criteria 

Some of the criteria used as quality controls include: total RNA integrity, 
20 cRNA quality, array image inspection, B2 oligo performance, and intemal control 
genes (GAPDH value greater than 1800). 

Data Analysis 

Affymetrix MAS 5.0 statistical analysis software was used to process the raw 
25 microarray image data for a given sample into quantitative signal values and 

associated present, absent or marginal calls for each probeset. A filter was then 
applied which excluded from further analysis all Affymetrix "control" genes 
(probesets labeled with AFFY_ prefix), as well as any probeset that did not have a 
"present" call at least in one of the samples. For this analysis our Bayesian 
30 classification and Vxinsight clustering analysis omitted this step, choosing instead to 
assume minimal a priori gene selection (Helman et al, 2003; Davidson et al, 2001). 
The filtering step reduced the number of probe sets from 12,625 to 8,414, resulting in 
a matrix of 8,414 x signal values, where N is the number of cases. The first stage of 
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our analysis consisted of a series of binary classification problems defined on the 
basis of clinical and biologic labels. The nominal class distinctions were ALL/AML, 
MLL/not-MLL, achieved complete remission CR/not-CR. Additionally, several 
derived classification problems — based on restrictions of the full cohort to particular 
5 subsets of data such as a Vxinsight cluster — ^were considered (see main text). The 
multivariate imsupervised learning techniques used included Bayesian nets (Helman 
et al,y 2003) and support vector machines (Guyon et aL, 2002). The performance of 
the derived classification algorithms was evaluated using fold-dependent leave-one- 
out cross validation (LOOCV) techniques. These methods combined allowed the 

10 identification of genes associated with remission or treatment failure and with the 
presence or absence of translocations of the MLL gene across the dataset. 
In order to identify potential clusters and inherent biologic groups, a large number of 
clinical co-variables were correlated with the expression data using unsupervised 
clustering methods such as hierarchical clustering, principal component analysis and a 

1 5 force-directed clustering algorithm coupled with the Vxhisight visualization tool. 
Agglomerative hierarchical clustering with average linkage (similar to Eisen et ai, 
1998) was performed with respect to both genes and samples, using the MATLAB 
(The Math works. Inc.), the Mat Array toolbox and native MATLAB statistics toolbox. 
The data for a given gene was first normalized by subtracting the mean expression 

20 value computed across all patients, and dividing by the standard deviation across all 
patients for each gene. The distance metric used was one minus Pearson's correlation 
coefficient; this choice enabled subsequent direct comparison with the Vxinsight 
cluster analysis, which is based on the r-statistic transformation of the correlation 
coefficient (Davidson et aL, 2001). The second clustering method was a particle- 

25 based algorithm implemented within the Vxinsight knowledge visualization tool 
(www.sandia. gov/projectsA/^ xinsi ght.html ). In this approach, a matrix of pair 
similarities is first computed for all combinations of patient samples. The pair 
similarities are given by the /-statistic transformation of the correlation coefficient 
determined from the normalized expression signatures of the samples (Davidson et 

30 al., 2001). The program then randomly assigns patient samples to locations (vertices) 
on a 2D graph, and draws lines (edges), thus linking each sample pair, and assigning 
each edge a weight corresponding to the pairwise t-statistic of the correlation. The 
resulting 2D graph constitutes a candidate clustering. To determine the optimal 
clustering, an iterative annealing procedure is followed, wherein a 'potential energy' 
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function that depends on edge distances and weights is minimized, following random 
moves of the vertices (Davidson et aL, 1998, 2001). Once the 2D graph has converged 
to a minimum energy configuration, the clustering defined by the graph is visualized 
as a 3D terrain map, where the vertical axis corresponds to the density of samples 
5 located in a given 2D region. The resulting clusters are robust with respect to random 
starting points and to the addition of noise to the similarity matrix, evaluated through 
its effect on neighbor stability histograms (Davidson et al, 2001). 
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Table 45. Genes with differential expression patterns between the Vxinsight clusters A 
and the rest of the cases. The gene lists are sorted into decreasing order based on the 
resuhing F-scores. 



Cluster A - Up-regulated genes 



F score 
symbol 


P 


Affymetrix 
number 


Gene description 


Gene 


167.99 


0.001 


37746_r_at 


Tumor suppressor gene 


TS5 


124.38 


0.005 


36276_at 


Contactin 2 axonal 


CNTN2 


123.10 


0.006 


33058_at 


Cytokeratin type II 


K6HF 


122.51 


0.010 


33137_at 


Transforming growth fector 


LTBP4 








beta binding protein 4 




119.66 


0.004 


721_g_at 


Heat-shock transcription factor 4 


HSF4 


114.94 


0.019 


396_f_at 


Erythropoietin receptor precursor 


EPOR 


114.21 


0.011 


41565_at 


Ataxin 2 related protein 


A2LP 


113.20 


0.007 


40792_s_at 


Triple functional domain interacting 


PTPRF 


109.97 


0.008 


884_at 


Integrin a3 


ITGA3 


98.55 


0.010 


40539_at 


Myosin IXB 


MY09B 


98.43 


0.040 


41694_at 


Temperature sensitivity complementing 


BHK21 


94.32 


0.020 


41347_at 


p70 ribosomal S6 kinase beta (Iroquois 


IRX5 








homeobox protein 5) 




92.02 


0.010 


38132_at 


Serum constituent protein 


MSE55 


88.80 


0.021 


39448_r_at 


B7 protein 


B7 


85.44 


0.035 


34573_at 


Ephrin A3 


EFNA3 


84.99 


0.020 


34894_r_at 


Protease serine 26 


PRSS22 


82.83 


0.029 


39775_at 


Complement component inhibitor 1 


SERPING1 


82.51 


0.031 


41499_at 


v-ski avian sarcoma viral oncogene 


SKI 


80.85 


0.010 


567_s_at 


Promyelocitic leukemia 


PML 


77.97 


0.020 


38707_r_at 


E2F transcription factor 4 


E2F4 


76.97 


0.044 


37061_at 


Chitotriosidase 


CHIT1 


73.43 


0.021 


1804_at 


Kallikrein 3 prostate specific antigen 


KLK3 


73.74 


0.041 


38058_at 


Dermatopontin precursor 


DPT 


72.07 


0.023 


39868_at 


poly rC binding protein 3 


PCBP3 


72.48 


0.033 


35910_f_at 


Zinc finger protein 200 


MMPL 








(matrix metalloproteinase like) 




69.03 


0.041 


39920_r_at 


Clq-related factor 


CRF 


68.53 


0.051 


37140_s_at 


Ectodermal dysplasia 1 anhidrotic 


ED1 
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68.52 
68.07 
67.57 
66.62 
63.85 

62.14 
61.86 



0.055 39306_at 

0.062 1925_at 

0.093 40501_s_at 

0.052 160020_at 

0.043 33448_at 

0.035 33034_at 

0.055 31393 r at 



Protease serine 1 6 thymus PRSS1 6 

Cyclin F CCNF 

Myosin-binding protein C slow-type MYBPC1 
Matrix matelloproteinase 14 preprotein MMP14 

Hepatocyte growth factor activator SPINT1 
inhibitor precursor 

Rhomboid veinlet Drosophila Wke RHBDL 

Undifferentiated embryonic cell UTF1 



61.28 
60.51 



0.039 41359_at 
0.103 538 at 



transcription factor 1 
Plakophilin 3 
CD34 antigen 



PKP3 
CD34 



Table 45. Continuation. Genes with differential expression patterns between the 
Vxinsight clusters A and the rest of the cases. 

Cluster A - Down-regulated genes 



F score 
symt>ol 



Affymetrix 



number 



Gene description 



Gene 



115.50 
114.41 

108.68 

98.82 

95.63 

95.11 
94.08 
92.64 
90.62 

90.18 
87.74 

87.26 



0.018 
0.015 



36991_at 
1241 at 



0.013 41187_at 

0.018 37675_at 

0.026 37029_at 

0.019 41834 _g_at 

0.027 41295_at 

0.027 1817_at 

0.029 35279_at 

0.027 32832_at 

0.028 1357_at 

0.047 1499 at 



Splicing factor arginine/serine-rich 4 SFRS4 

protein tyrosine phosphatase type PTP4A 
IVA member 2 

death-associated protein 6 DAXX 

phosphate carrier precursor 1 b PHC 

ATP synthase H transporting ATP50 
mitochondrial F1 complex O subunit 

jumping translocation breakpoint JTB 

GTT1 protein GTT1 

prefoldin 5 PFDN5 

Taxi human T-cell leukemia virus TAX1BP1 
type I binding protein 1 

erythroblast macrophage attacher No symbol 

ubiquitin specific protease USP4 
proto-oncogene 

farnesyltransferase CAAX box alpha FNTA 
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84.12 


0.048 


37766. 


.s_at 


proteasome prosome macropain 26S 


PSMC5 










subunit ATPase 5 




83.23 


0.056 


1399_at 


elongin C 


TCEB1 


82.82 


0.042 


41241. 


.at 


asparaginyl-tRNA synthetase 


NARS 


78.67 


0.030 


36492. 


at 


proteasome prosome macropain 26S 


PSMD9 










subunit non-ATPase 9 




78.21 


0.043 


37581. 


at 


protein phosphatase 6 catalytic subunit PPP6C 


78.18 


0.082 


39360. 


at 


sorting nexin 3 


No symbol 


76.07 


0.054 


36616. 


.at 


DAZ associated protein 2 


No symbol 


75.21 


0.063 


34330. 


.at 


cytochrome c oxidase subunit Vila 


COX7A2L 










polypeptide 2 like 




74.72 


0.044 


31670. 


.s_at 


calcium/calmodulin-dependent protein 


CAMKG 










kinase CaM kinase 11 gamma 




74.30 


0.045 


39184. 


.at 


elongin B 


TCEB2 


73.46 


0.055 


34302. 


.at 


eukaryotic translation initiation factor 3 


EIF3S4 










subunit 4 delta 44kD 




72.24 


0.074 


35298. 


.at 


eukaryotic translation initiation fector 3 


EIF3S7 










subunit 7 zeta 66/67kD 




71.36 


0.055 


41551. 


.at 


similar to S. cerevisiae RER1 


No symbol 


71.28 


0.057 


35297. 


.at 


NADH dehydrogenase ubiquinone 


NDUFAB1 










1 alpha/beta subcomplex 1 8kD SDAP 




71.06 


0.059 


40874. 


.at 


endothelial differentiation-related 1 


EDF1 


70.73 


0.045 


38455. 


.at 


small nuclear ribonucleoprotein 


SNRPB 










polypeptides B and B1 




69.57 


0.082 


935_at 


adenylyl cyclase-associated protein 


No symbol 


69.09 


0.077 


31492. 


.at 


muscle specific gene 


No symbol 


68.81 


0.043 


37672. 


.at 


ubiquitin specific protease 7 herpes 


USP7 










virus-associated 




68.31 


0.066 


35319. 


.at 


CCCTC-binding factor zinc finger 


CTCF 



protein 



Table 45, Continuation. Genes with difTerential expression patterns between the 
Vxinsight cluster B and the rest of the cases. 

Cluster B - Up-regulated genes 

F score p Affynnetrix Gene description Gene 

symbol number 



250.55 



0.001 40103_at 



Villin 2 
221 



VI L2 



157.12 
122.41 
113.79 
113.35 
109.78 
107.87 
105.40 
101.07 
91.63 

91.08 

89.36 
87.23 
85.38 
81.74 

74.04 
73.16 
73.14 
71.06 
70.78 
68.13 
67.74 



0.003 
0.005 
0.005 
0.006 
0.010 
0.011 
0.005 
0.006 
0.010 



1096_g_at 

38269_at 

2047_s_at 

35298_at 

36991_at 

854_at 

41356_at 

3801 7_at 

37672 at 



0.020 37585_at 

0.023 31492_at 

0.008 36111_s_at 

0.041 1754_at 

0.039 1357_at 

0.047 41834 _g_at 

0.020 39044_s_at 

0.013 38604_at 

0.010 32238_at 

0.031 38054_at 

0.050 1817_at 

0.018 32842 at 



CD1 9 antigen CD19 

Protein kinase D2 PKD2 

Junction plakoglobin isoform 1 JUP 
Eukariotic translation initiation factor 3 EIF3 

Splicing factor arg/ser rich 4 SFRS4 

B lynnphoid tyrosine kinase BLK 

B-cell CLL/lymphoma 1 1 A BCL1 1 A 

CD79A antigen CD79A 

Ubiquitin specific protease 7 herpes USP7 
virus associated 

Small nuclear ribonucleotide SNRPA1 
polypeptide A 

Muscle specific gene M9 

Splicing factor arg/ser rich 2 SFRS2 

Death associated protein DAXX 

Ubiquitin specific protease proto- USP 
oncogene 

Jumping translocation breakpoint JTB 

Diacylgiycerol kinase delta DGKD 

Neuropeptide Y NPY 

Binding integrator 1 BIN1 
Hepatitis B virus interacting x-protein HBXIP 

Prefoldin 5 PFDN5 

B-cell CLL/lymphoma BCL2 



63.71 

61.60 
59.35 
57.53 
56.43 
56.22 

56.07 
54.40 
53.94 
51.74 
51.32 
50.93 
50.77 



0.069 40189_at 

0.015 33304_at 

0.025 38989_at 

0.045 36630_at 

0.035 36949_at 

0.027 1814_at 

0.031 3931 8_at 

0.037 37028_at 

0.021 1102_s_at 

0.033 40828_at 

0.025 493_at 

0.039 40365_at 

0.037 32070 at 



SET translocation myeloid-leukemia SET 
associated 

Interferon stimulated gene 20kD ISG20 

DC 12 protein DC12 

Delta sleep inducing petide DSIPI 

Casein kinase 1 delta CSNK1D 

Transforming growth factor beta TGFBR2 
receptor 

T-cell lymphoma-1 TCL1A 

DNA damage inducible PPP1R15A 

Nuclear receptor subfamily 3 group C NR3C1 

PAK-interacting exchange factor beta ARHGEF7 

Casein kinase 1 delta CSNK1D 

Guanine nucleotide binding protein G GNA15 

Tyrosin phosphatase receptor type PTPRCAP 



222 



50.59 


0.054 


35974. 


_at 


Lymphoid-restricted membrane protein LRMP 


50.37 


0.048 


34180. 


_at 


Rho guanine nucleotide exchange factorGEFlO 


50.06 


0.031 


280_g. 


.at 


Nuclear receptor subfamily 4 group A1 NR4A1 


48.15 


0.017 


41203. 


.at 


Zinc finger protein 162 (splice factorl) SF1 


47.98 


0.030 


40841. 


.at 


Transforming acidic coiled-coil TACC1 



Table 45. Continuation. Genes with differential expression patterns between the 
Vxinsight cluster B and the rest of the cases. 



Cluster B - 


Down-regulated genes 






F score 


P 


Affymetrix 


Gene description 


Gene 


symbol 




number 




81.4 


0.007 


39689_at 


cystatin C amyloid angiopathy 


CST3 


78.48 


0.004 


36938_at 


N-acylsphingosine amidohydrolase 
acid ceramidase 


ASAH 


67 


0.011 


1230_g_at 


cisptatin resistance associated 


No symbol 


57.88 


0.022 


34885_at 


synaptogyrin 2 


SYNGR2 


57.26 


0.018 


35367_at 


lectin galactoslde-binding soluble 3 
galectin 3 


LGALS3 


54.71 


0.015 


36766_at 


ribonuclease RNase A family 2 liver 
eosinophil-derived neurotoxin 


RNASE2 


52.66 


0.029 


32747_at 


aldehyde dehydrogenase 2 family 
mitochondrial 


ALDH2 


51.51 


0.022 


36879_at 


endothelial cell growth factor 1 
platelet-derived 


ECGF1 


51.32 


0.021 


39994_at 


chemokine C-C motif receptor 1 


CCR1 


50.88 


0.014 


3501 2.at 


myeloid cell nuclear differentiation 
antigen 


MNDA 


50.53 


0.02 


36889_at 


Fc fragment of IgE high affinity 1 
receptor for gamma polypeptide 
precursor 


FCER1G 


50.41 


0.023 


34789_at 


serine or cysteine proteinase inhibitor 
clade B ovalbumin member 6 


PIR6 


50.21 


0.029 


1052_s_at 


CCAAT/enhancer binding protein 
C/EBP delta 


CEBPD 


49.91 


0.014 


37398_at 


platelet/endothelial cell adhesion 
molecule CD31 antigen 


CD31 


49.79 


0.022 


40580_r_at 


parathymosin 


PTMS 
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47.39 


0.03 


41096_at 


S100 calcium-binding protein A8 


S100A8 


47.26 


0.031 


33963_at 


azurocidin 1 cationic antimicrobial 


No symbol 








protein 37 




47.06 


0.018 


36465_at 


interferon regulatory factor 5 


No symbol 


46.95 


0.03 


37021_at 


cathepsin H 


CTSH 


46.36 


0.029 


35926_s_at 


leukocyte immunoglobulin-like receptor 


No symbol 








subfamily B with TM and ITIM domains 




46.02 


0.02 


41523_at 


RAB32 member RAS oncogene family 


RAB32 


45.94 


0.034 


38363_at 


TYRO protein tyrosine kinase binding 


TYROBP 








protein 




44.74 


0.032 


33856_at 


CAAX box 1 


CXX1 


44.73 


0.038 


40282_s_at 


adipsin/complement factor D precursor 


DF 


44.5 


0.027 


32451_at 


membrane-spanning 4-domains 


No symbol 








subfamily A member 3 hematopoietic 










cell-specific 




44.08 


0.045 


38631_at 


tumor necrosis ^ctor alpha-induced 


TNFAIP2 








protein 2 




44.01 


0.053 


40762 ^_at 


solute carrier family 16 monocarboxyllc 


SLC16A5 



acid transporters member 5 

Table 45. Continuation. Genes with differential expression patterns between the 
Vxinsight cluster C and the rest of the cases. 



Cluster C - Up-regulated genes 



F score p Affymetrix Gene description 

symbol number 



Gene 



284.97 0.001 6938_at 

132.03 0.001 9689_at 

126.67 0.013 1637_at 

114.85 0.010 38363_at 

104.53 0.009 35297_at 

100,84 0.008 1230 _g_at 

93.33 0.008 36879_at 

90.92 0.009 3856 at 



N-acylsphlngosine ardohydrolase acid ASAH 
ceramidase 

Cystatin C CST3 
Mitogen-acttvated protein kinase- MAPKAPK3 
activated protein kinase 3 

Tyro Protein tyrosine kinase binding TYROBP 
protein 

NADH dehydrogenase ubiquinone 1 NDUFAB1 

Cisplatin resistance associated 

Endothelial cell growth factor 1 - platelet ECGF1 

derived 

Farnesyltransferase CAAX box alpha FNTA 
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89.47 


0.017 


35279_at 


Taxi human T-cell leukemia virus type 1 TAX1BP1 








binding protein 1 




88.39 


0.047 


39160_at 


Pyruvate dehydrogenase llpoamide betaPDHB 


84.75 


0.036 


41187_at 


Death-associated protein 6 


DAP6 


84.18 


0.029 


41495_at 


GTT1 protein 


GTT1 


81.31 


0.006 


41523_at 


RAB32 member RAS oncogene family 


RAB32 


80.08 


0.048 


37337_at 


Small nuclear ribonucleoprotein G 


SNRPG 


75.51 


0.038 


402_s_at 


Intercellular adhesion molecule 


ICAM3 


74.82 


0.014 


40282_s_at 


Adipsin/complement factor D 


DF 


72.20 


0.050 


39360_at 


Sortin nexin 3 


SNX3 


70.26 


0.055 


37726_at 


Mitochondria! ribosomal protein L3 


MRPL3 


69.05 


0.016 


39581_at 


Cystatin A (stefin A) 


CSTA 


68.66 


0.035 


1817_at 


Prefoldin 5 


PFDN5 


67.80 


0.059 


36620_at 


Superoxide dismutase 1 soluble 


SOD1 


66.34 


. 0.090 


37670_at 


Annexin VII 


ANXA7 


65.36 


0.065 


38097_at 


Etoposide-induced mRNA 


PIG8 


65.07 


0.092 


824_at 


Gtutathione-S-transferase like 


GSTTLp28 


64.88 


0.016 


39593_at 


Similar to fibrinogen-like 2, clone 










l\/IGC:22391, mRNA, complete cds 




63.75 


0.024 


3501 2_at 


Myeloid cell nuclear differentiation 


MNDA 


63.30 


0.047 


1399_at 


Elongin C 


TCEB1 


62.02 


0.079 


891_at 


YY1 transcription factor 


YY1 


61.60 


0.079 


38992_at 


DEK oncogene DNA binding 


DEK 


54.78 


0.036 


37021_at 


Cathepsin H 


CTSH 


54.28 


0.029 


41198_at 


Granulin 


GRN 


54.27 


0.028 


38631_at 


Tumor necrosis factor alpha-induced 


TNFAIP2 








protein 2 




54.26 


0.032 


34860_g_at 


Melanoma antigen, family D, 2 


MAGED2 


52.80 


0.037 


1693_s_at 


Tissue inhibitor of metalloprotease 1 


TIMP1 


48.83 


0.031 


38533_s_at 


Integrin alpha M precursor 


ITGAM 


48.64 


0.038 


36709_at 


Integrin alpha X precursor 


ITGAX 


48.37 


0.021 


34885_at 


Synaptogyrin 2 


SYNGR2 
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Table 45. Continuation. Genes with differential expression patterns between the 
Vximight cluster C and the rest of the cases. 



Cluster C - Down-regulated genes 



F score 


p 


Affymetrix 


Gene description 


Gene 


symbol 




number 




105.94 


0.006 


1096_s_at 


CD1 9 antigen 


CD19 


103.5 


0.005 


40103_at 


villin 2 


VIL2 


80.41 


0.009 


2047_s_at 


junction plakoglobin isoform 1 


JUP 


80.14 


0.013 


3801 7_at 


CD79A antigen isoform 2 precursor 


CD79A 


77.12 


0.025 


39327_at 


p53-responsive gene 


PRG2 


72.29 


0.017 


38269_at 


protein kinase D2 


PKD2 


72.15 


0.011 


3931 8_at 


T-cell lymphoma-1 


TCL1A 


66.16 


0.022 


854_at 


B lymphoid tyrosine kinase 


BLK 


64.49 


0.019 


32238_at 


bridging integrator 1 


BIN1 


61.79 


0.028 


38604_at 


neuropeptide Y 


NPY 


57.28 


0.049 


41356_at 


hypothetical protein FLJ10173 


FLJ10173 


56.67 


0.028 


41165_g_at 


Immunoglobulin mu 


IGHM 


56.67 


0.028 


41 165_g_at 


B-cell CLL/lymphoma 11 A zinc finger 
protein 


BCL11A 


55.58 


0.038 


32842_at 


B-cell CLL/lymphoma 7A 


BCL7A 


52.05 


0.025 


493_at 


casein kinase 1 delta 


CSNK1D 


49.7 


0.03 


36933_at 


N-myc downstream regulated 


NDRG1 


48.04 


0.025 


38018_g_at 


CD79A antigen isoform 2 precursor 


CD79A 


47.31 


0.049 


41151_at 


SKIP for skeletal muscle and kidney 
enriched inositol phosphatase 


SKIP 
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Table 47. Discriminating genes that distinguish between remission and fail 
overall derived from SVM analysis. 



Affymetrix Gene description Gene 
Locus 

number symbol 



1 
1 
2 
3 
4 
5 
6 
7 
8 

9 

10 

11 

12 

13 

14 



41 165_g_at immunoglobulin heavy constant mu 
14q32.33 

39389_at CD9 antigen (p24) 
12p13 

41058_g_at uncharacterized liypothalamus protein HT012 
6p22.2 

31459_i_at immunoglobulin lambda locus 

22q11.1 

38389_at 2',5'-oligoadenylate synthetase 1 (40-46 kD) 
12q24.1 

37504_at E3 ubiquitin ligase SMURF1 
7q21.1 

40367_at bone morphogenetic protein 2 
20p12 

32637_r_at PI-3-kinase-related kinase SMG-1 
16p12.3 

3993 1_at dual-specificity tyrosine-(Y)-phosphorylation 
1q32 



37054_at 

20q11 

1404_r_at 

17q11.2 

1292_at 

2q11 

37709_at 

Xp22.32 

36857_at 

5p13.2 

41196_at 

17q21 



regulated kinase 3 

bactericidal/permeability-increasing protein 
small inducible cytokine A5 (RANTES) 
dual specificity phosphatase 2 
DNA segment, numerous copies 
RAD1 (S. pombe) homolog 
karyopherin (importin) beta 1 



IGHM 

CD9 

HT012 

IGL 

OAS1 

SMURF1 

BMP2 

SMG1 

DYRK3 

BPI 

SCYA5 

DUSP2 

DXF68 

RAD1 

KPNB1 
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15 
16 
17 



18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 

29 
30 
31 
32 



1182_at 

2q33 

34961_at 

3q13.13 

37862_at 

1p31 



38772_at 
1p31 

33208_at 

13q32 

37837_at 

18q23 

34031J_at 

7q21 

38220_at 

1p22 

34684_at 

12p12 

39449_at 

5p13 

32638_s_at 

16p12.3 

35957_at 

16p13 

34363_at 

5q31 

35431 _g_at 
14q24.1 

3501 2_at 
1q22 

38432_at 

1p36.33 

35664_at 

4q22 

41862_at 

11q25 



phospholipase C, epsilon PLCE 

T cell activation, increased late expression TACTILE 

dihydrolipoamide branched chain transacylase DBT 

(E2 component of branched chain keto acid 
dehydrogenase complex; maple syrup disease) 

cysteine-rich, angiogenic inducer, 61 CYR61 

DnaJ (Hsp40) homolog, subfamily C. member 3 DNAJC3 

KIAA0863 protein KIAA0863 

cerebral cavernous malformations 1 CCM1 

dihydropyrimidine dehydrogenase DPYD 

RecQ protein-like (DNA heiicase Ql-tike) RECQL 

S-phase kinase-associated protein 2 (p45) SKP2 

PI-3-kinase-related kinase SMG-1 SMG1 

stannin SNN 

selenoprotein P, plasma, 1 SEPP1 

RNA polymerase II transcriptional regulation MED6 

mediator (MedG, S. cerevisiae, homolog of) 

myeloid cell nuclear differentiation antigen MNDA 

interferon-stimulated protein, 15 kDa ISG15 

multimerin MMRN 

KIAA0056 protein KIAA0056 
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33 
34 
35 
36 
37 



33210_at 
14q 

35794_at 

Spter 

36108_at 

6p21.3 

3561 4_at 

20q13.3 

32089_at 

10p12 



YY1 transcription factor 



KIAA0942 protein 



HLA. class II, DQ be\a 1 



YY1 

KIAA0942 
DQB1 



transcription factor-like 5 (basic helix-loop-helix) TCFL5 



sperm associated antigen 6 



SPAG6 



Table 47. (Continuation). Discriminating genes that distinguish between 
remission and fail overall derived from SVM analysis. 



Affymetrix Gene description 
Locus 
number 



Gene 



symbol 



38 1343_s_at serine (or cysteine) proteinase inhibitor) 
18q21.3 

39 665_at 
3p21.1 

40 40901_at 
14q13 

41 39299_at 
2q34 

42 34446_at 
1q24 

43 33956_at 
8q13.3 

44 37184_at 
7q11.23 

45 1773_at 
14q23 
34731 at 



serine/threonine kinase 2 

nuclear autoantigen 

KIAA0971 protein 

KIAA0471 gene product 

MD-2 protein 

syntaxin 1A (brain) 

farnesyitransferase, CAAX box, beta 



46 
10q24.32 

47 41700_at 
5q13 



KIAA0185 protein 



coagulation factor II (thrombin) receptor 



SERPINB 

STK2 

GS2NA 

KIAA0971 

KIAA0471 

MD-2 

STX1A 

FNTB 

KIAA0185 

F2R 
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48 
49 
50 
51 

52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 
65 
66 



38407_r_at prostaglandin D2 synthase (21 kD, brain) GDS 
9q34.2 

40088_at nuclear receptor interacting protein 1 NRiPl 
21q11.2 

33124_at vaccinia related kinase 2 VRK2 
2p16 

32964_at egf-like module containing, mucin-like, hormone EMR1 
19p13.3 

receptor-like sequence 1 
39560_at chromobox homolog 6 CBX6 
22q13.1 

39838_at CLIP-associating protein 1 CLASP1 
2q14.2 

40166_at CS box-containing WD protein LOC55884 

36927_at hypothetical protein, expressed in osteoblast G.S3686 
1p22.3 

41 393_at zinc finger protein 1 95 ZNF1 95 

II pi 5.5 

35041_at neurotrophin 3 NTF3 
12p13 
40238_at 
16p12 
39926_at 
5q31 

36674_at small inducible cytokine A4 SCYA4 
17q21 

32132_at KIAA0675 gene product KIAA0675 

3q13.13 

38252_s_at 1.6-glucosidase, 4-alpha-glucanotransferase AGL 
1p21 

33598_r_at cold autoinfiammatory syndrome 1 CIAS1 
1q44 

37409_at SFRS protein kinase 2 SRPK2 
7q22 

41019_at phosducin-like PDCL 
9q12 

III 3_at bone morphogenetic protein 2 BMP2 
20p12 



G protein-coupled receptor, family C, group 5, GPRC5B 



MAD (mothers against decapentaplegic, Drosoph) MADH5 
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67 

68 

69 

70 
71 

72 

73 

74 

75 

76 

77 

78 

79 

80 

81 

82 

83 



37208_at 

7q11.2 

32822_at 

4q35 

32249_at 

1q32 

39600_at 

32648_at 

14q32 

39269_at 

13q12.3 

37724_at 

8q24.12 

35606_at 

15q21 

31926_at 

8q11 

32142_at 

8p22 

32789_at 

3q29 

37279_at 

8q13 

40246_at 

3q29 

37547_at 

7p14 

32298_at 

8p11.2 

40496_at 

12p13 

39032_at 

13q14 



phosphoserine phosphatase-like 

solute carrier family 25 

H factor (complementHike 1 

EST 

delta-like homolog (Drosophila) 
replication factor C (activator 1) 3 (38kD) 



PSPHL 

SLC25A4 

HFL1 

DLK1 
RFC3 



v-myc avian myelocytomatosis viral oncogene MYC 



histidine decarboxylase 



cytochrome P450, subfamily VIIA 



HDC 
CYP7A1 



serine/threonine kinase 3 (Ste20, yeast homolog) STK3 



nuclear cap binding protein subunit 2, 20kD NCBP2 



GTP-binding protein (skeletal muscle) GEM 



discs, large (Drosophila) homolog 1 DLG1 



PTH-responsive osteosarcoma B1 protein B1 



a disintegrin and metalloproteinase domain 2 ADAM2 



complement component 1 , s subcomponent C1 S 



transforming growth factor beta-stimulated protein TSC22 
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SUPPLEMENTARY INFORMATION 



Sample management 

Cell suspensions from diagnostic bone marrow aspirates or peripheral 
5 blood samples were handled according to the cryopreservation procedure of the 
St. Jude*s Children's Hospital. Samples were retrieved from cryopreservation at 

-135^C and thawed quickly at 37 ^C and then washed by centriftigation at 1200 
rpm for 5 minutes in warmed 20%(v/v) Fetal Bovine Serum in Dulbecco's 
Modified Minimum Essential Medium (Invitrogen, Grand Island, NY). 
1 0 Cytospins were prepared from thawed samples, stained with Wright's stain and 
assessed for percent blasts and cell viability by light microscopy. Decanted cell 
pellets were used immediately for RNA purification. 

RNA extraction and T7 amplification 

15 An average of 2 x 10^ cells were used for the total RNA extraction with the 
Qiagen RNeasy mini kit (VWR International AB, Stockolm, Sweden). The 
mean of the purified total RNA concentration was 0.5p,g/ul (approximately 
25^g of total RNA yield), as quantified with the RiboGreen assay (Molecular 
Probes, Eugene, OR). All samples met assay quality standards as recommended 

20 by Afifymetrix. The A260nm/A280nm ratio was determined 

spectrophotometrically in 10 mM Tris, pH 8.0, ImM EDTA, and all samples 
used for array analysis exceeded values of 1 .8. The RNA integrity was 
analyzed by electrophoresis using the RNA 6000 Nano Assay run in the Lab- 
on-a Chip (Agilent Technologies, Palo Alto, CA). High quality RNA quality 

25 criteria included a 28S rRNA / 1 8S rRNA peak area ratio >1 .5 and the absence 
of DNA contamination. To prepare cRNA target, the mRNA was reverse 
transcribed into cDNA, followed by re-transcription in a method that uses two 
rounds of amplification devised for small starting RNA samples, kindly 
provided by Ihor Lemischka (Princeton University), with the following 

30 modifications: linear acrylamide (lOug/ml, Ambion, Austin, TX) was used as a 
co-precipitant in steps that used alcohol precipitation and the starting amount of 
RNA was 2.5 ug of total RNA. Briefly, a T7- (dT) 24 oligonucleotide primer 
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(Genset Oligos, La Jolla, CA) was annealed to 2.5 ug of total RNA and reverse 
transcribed with Superscript II (Invitrogen, Grand Island, NY) at 42°C for 60 
min. Second strand cDNA synthesis by DNA polymerase I (Invitrogen) at 16**C 
for 120 min was followed by extraction with phenol :chloroform:isoamyl 
5 alcohol (25:24: l)(Sigma, St. Louis, MO) and microconcentration (Microcon 50. 
Millipore, Bedford, MA). RNA was then transcribed from the cDNA with a 
high yield T7 RNA polymerase kit (Megascript, Ambion, Austin, TX). The 
second round of amplification utilized random hexamer and T7- (dT) 24 

oligonucleotide primers. Superscript II, DNA polymerase I and a biotin labeling 
10 high yield T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, NY). The 
biotin-labeled cRNA was purified on RNeasy mini kit columns, eluted with 
50ul of 45®C RNase-free water and quantified using the RiboGreen assay. 

Target labeling and probe hybridization 

15 Following quality check on Agilent Lab-on-a-Chip, 15 ug cRNA were 

fragmented for 35 minutes in 200mM Tris-acetate pH 8.1, 150mM MgOAc and 
500 mM KOAc foUovsdng the Affymetrix protocol (Affymetrix, Santa Clara, 
CA). The fragmented RNA was then hybridized for 20 hours at 45®C to 
HG_U95Av2 probes. The hybridized probe arrays were washed and stained 

20 with the EukGE-WS2 fluidics protocol (Affymetrix), including streptavidin 
phycoerythrin conjugate (SAFE, Molecular Probes, Eugene, OR) and an 
antibody amplification step (Anti-streptavidin, biotinylated. Vector Labs, 
Burlingame, CA). HG_U95Av2 chips were scanned at 488 nm, as 
recommended by Affymetrix. The images were inspected to detect artifacts. 

25 The expression value of each gene was calculated using Affymetrix 

GENECHIP software for the 12,625 Open Reading Frames on the probe set. 

Data presentation and exclusion criteria 

Criteria used as quality control for exclusion of poor sample arrays 
30 included: total RNA integrity, cRNA quality, probe array image inspection, B2 
oligo staining (used for Array grid aligrmient), and internal control genes 
(GAPDH value greater than 1800). Of the 142 cases initially selected, 126 were 
ultimately retained in the study; 16 cases were excluded from the final analysis 
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due to poor quality total RNA or cRNA amplification or a poor hybridization 
(low percentage of expressed genes <10%, poor 375* amplification ratios). 

Data Analysis 

5 

1 . Data preprocessing 

The preprocessing stage was divided in filtering and transformation. For 
filtering, the control probesets were removed (i.e. probesets whose accession ID 
starts with the AFFX prefix), as well as all probesets that had at least one 
10 "absent" call (as determined by the Affymetrix MAS 5.0 statistical software) 

across all training set samples. In the transformation stage, the natural logarithm 
of the gene expression values (i.e. the signal values) was taken. This is the 
preprocessing method used for most of the analysis methods; except those in 
which different preprocessing is mentioned in the detailed information below. 

15 

2. Description of the supervised leaming methods for class prediction 
The exploratory evaluation of our data set was performed in several steps. The 
first step was the construction of predictive classification algorithms that linked 
gene expression data to patient outcome as well as the traditional clinical 

20 variables that define prognosis. With previous knowledge of their sample 
nature, the 126 patients were divided into statistically balanced and 
representative training (82 patients) and test sets (44 patients), according to the 
clinical labels (leukemia lineage, cytogenetics and outcome). For classification 
purposes, several primary supervised approaches were used, including Bayesian 

25 networks, recursive feature elimination in the context of Support Vector 
Machines (SVM-RFE), linear discriminant analysis and fuzzy logics. 
Classification tasks were as follows: 

- ALL vs. AML - Remission, vs. Fail 
- 1(4; 1 1 ) vs. not t(4; 1 1 ) - MLL vs. Not MLL 

30 - Remission, vs. Fail in ALL - Remission, vs. Fail in AML 

- Remission, vs. Fail in Vxinsight cluster A - Remission, vs. Fail in Vxinsight 
cluster B 

- Remission, vs. Fail in Vxinsight cluster C - MLL vs. Not MLL in ALL 
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- MLL vs. Not MLL in AML - Remission, vs. Fail in MLL 

- Remission, vs. Fail in Not MLL 

2.1. Bayesian Networks 
5 We employed the Bayesian network framework described in (6), without 

any data preprocessing. The Bayesian network modeling and learning paradigm 
was introduced in Pearl (1988) and Heckerman et al. (1995), (7, 8) and has been 
studied extensively in the statistical machine learning literature. Our work 
tailors this paradigm to the analysis of gene expression data in general and to 

10 the classification problem in particular. A Bayesian net is a graph-based model 
for representing probabilistic relationships between random variables. The 
random variables, which may, for example, represent gene expression levels, 
are modeled as graph nodes; probabilistic relationships are captured by directed 
edges between the nodes and conditional probability distributions associated 

1 5 with the nodes. A Bayesian net asserts that each node is statistically 

independent of all its no descendants, once the values of its parents (immediate 
ancestors) in the graph are known. That is, a node n's parents render n and its 
no descendants conditionally independent. In our modeling, we consider 
Bayesian nets in which each gene is a node, and the class label of interest is an 

20 additional node C having no children. The conditional independence assertion 
associated with (leaf) node C implies that the classification of a case q depends 
only on the expression levels of the genes, which are Cs parents in the net. 
More formally, distribution Pr{q[C] \ q[genes]} is identical to distribution 
Pr{q[C] I q[Par(C)]}, where Par(C) denotes the parent set of C. Note, in 

25 particular, that the classification does not depend on other aspects (other than 
the parent set of C) of the graph structure of the Bayesian net. Thus, while the 
Bayesian network model ultimately can be a highly appropriate tool for learning 
global gene regulatory networks, in the context of classification tasks such as 
those considered in this paper, the Bayesian network learning problem may be 

30 reduced to the problem of learning subnetworks consisting only of the class 
label and its parents. It is important to emphasize how this modeling differs 
from that of a naive Bayesian classifier (9, 10) and from the generalization 
described in (1 1). A naive Bayesian classifier assumes independence of the 
attributes (genes), given the value of the class label. Under this assumption, the 
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conditional probability Pr{q[C] \ q [genes]} can be computed from the product 
TlgiS genes Pr{ qfgj \q[C] } of the marginal conditional probabilities. The 
naive Bayesian model is equivalent to a Bayesian net in which no edges exist 
between the genes, and in which an edge exists between every gene and the 
5 class labels. We make neither assumption. Rather, we ignore the issue of what 
edges may exist between the genes, and compute Pr{ q[C] \ q [genes] } as Pr{ 
q[C] I q[Pcir(C)]}y an equivalence that is valid regardless of what edges exist 
between the genes, provided only that Par(C) is a set of genes sufficient to 
render the class label conditionally independent of the remaining genes. 

10 Friedman et al (1997) (11) drops the independence assumption of a naive 

Bayesian classifier and attempts to leam edges between the attributes (genes, in 
our context), while maintaining an edge from the class label into each attribute. 
This approach yields good improvements over naive Bayesian classifiers in the 
experiments (application domains other than gene expression data) reported in 

15 Friedman et al (1997) (11). Our approach exploits a prior belief (supported by 
experimental results reported in (6) and in other gene expression analyses) that 
for the gene expression application domain, only a small nimiber of genes is 
necessary to render the class label (practically) conditionally independent of the 
remaining genes. This both makes leaming parent sets Par(C) tractable, and 

20 generally allows the quantity Pr{ q[C] \ q[Par(C)] } to be well estimated from a 
training sample. Even with the focus on restricted subnetworks, the leaming 
problem is enormously difficult. Given a collection of training cases, we must 
leam one or more "plausible" Bayesian subnetworks, each consisting of class 
label node C and its parent set Par(C), The main factors contributing to the 

25 difficulty of this leaming problem are the large number genes, the fact that the 
expression values of the genes are continuous, and the fact that expression data 
generally is rather noisy. The approach to Bayesian network leaming employed 
here identifies parent sets which are supported by current evidence by 
employing an extemal gene selection algorithm which produces between 20 and 

30 30 genes using a measure of class separation quality similar to the TNoM score 
described in (12, 13). A binary binning of each selected gene's expression value 
about a point of maximal class separation also is performed. The set of selected 
genes then is searched exhaustively for parent sets of size 5 or less, with the 
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induced candidate networks being evaluated by the BD scoring metric (8). This 
metric, along with a variance factor, is used to blend the predictions made by 
the 500 best scoring networks (6). Each of these 500 Bayesian networks can be 
viewed as a competing hypothesis for explaining the current evidence (i.e., 
5 training data and simple priors) for the corresponding classification task, and the 
gene interactions each suggests are potentially of independent interest as well. 
Another significant aspect of our method involves a distinct normalization of 
the gene expression data for each classification task. We have found this a 
necessary follow-up step to the standard Affymetrix scaling algorithm. Our 

10 approach to normalization is to consider, for each case, the average expression 
value over some designated set of genes, and to scale each case so that this 
average value is the same for all cases. This approach allows the analysis to 
concentrate on relative gene expression values within a case by standardizing a 
reference point between cases. The designated reference genes for a given 

1 5 classification task are selected based on poorest class separation quality, which 
is a heuristic for identifying reference genes likely to be independent of the 
class label. 

2.2 Support Vector Machines 

20 Support vector machines (SVMs) are powerful tools for data 

classification (14, 15, 16). The development of the SVM was motivated, in the 
simple case of two linearly separable classes, by the desire to choose an optimal 
linear classifier out of an infinite number of linear classifiers that can separate 
the data. This optimal classifier corresponds not only to a hyperplane that 

25 separates the classes but also to a hyperplane that attempts to be as far away as 
possible from all data points. If one imagines inserting the widest possible 
corridor between data points (with data points belonging to one class on one 
side of the corridor and data points belonging to the other class on the other 
side), then the optimal hyperplane would correspond to the imaginary 

30 line/plane/hyperplane running through the middle of this corridor. 

The SVM has a number of characteristics that make it particularly 
appealing within the context of gene selection and the classification of gene 
expression data, namely: 
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- The SVM is a multivariate classification algorithm that takes into 
account each gene simultaneously in a weighted fashion during training, and 

- It scales quadratically with the number of training samples, N, and not 
with the number of features/genes, d. 

5 In order to be computationally feasible, other methods first have to 

reduce the number of dimensions (features/genes), and then classify the data in 
the reduced space. A univariate feature selection process or filter ranks genes 
according to how well each gene individually classifies the data (13,17). The 
overall SVM classification is then heavily dependent upon how successfiil the 

10 univariate feature selection process is in pruning genes that have little class- 
distinction information content. In contrast, the SVM provides an effective 
mechanism for both classification and feature selection via the Recursive 
Feature Elimination algorithm (18). This is a great advantage in gene 
expression problems where d is much greater than N because the number of 

15 features does not have to be reduced a priori. 

Recursive Feature Elimination (RFE) is an SVM-based iterative 
procedure that generates a nested sequence of gene subsets whereby the subset 
obtained at iteration k+1 is contained in the subset obtained at iteration k. The 
genes that are kept per iteration correspond to genes that have the largest weight 

20 magnitudes — ^the rationale being that genes with large weight magnitudes carry 
more information with respect to class discrimination than those genes with 
small weight magnitudes. 

Implementation of RFE algorithm: The rate of reduction in the niunber of genes 
25 via the RFE algorithm typically been geometric in nature (1 8,19). For example, 
in (18), 50% of the genes were removed per RFE iteration. However, as in (19), 
we have taken a less aggressive pruning approach with respect to the number of 
genes being removed per RFE iteration. In this work, the number of genes 
removed was constant within blocks of intervals: from 8000 to 1000 genes, 
30 1000 genes were removed per iteration; from 900 to 200 genes, 100 genes were 
removed per iteration, etc. 

Leave-one-out cross-validation (LOOCV) was used to assess the 
performance of a linear SVM classifier. The LOOCV procedure divides the 
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training samples into N disjoint sets where the i set contains samples l,...,i- 
1 ,i+ 1 . . ,N. The SVM classifier is then trained on the i^ set and tested on the 
withheld i^ sample. This process is repeated for each set and the LOOCV error 
is the overall number of misclassifications divided by N. Note that the RFE 
5 algorithm was performed separately on each leave-one-out fold — failure to do 
induces a selection bias that yields LOOCV error rates that are overly optimistic 
(20). If the benchmark for determining the number of genes to use in training 
the SVM classifier is based only upon RFE iterations with low LOOCV error, 
then one finds in practice many sets of gene numbers (e.g. 500, 100 or 50 genes) 
10 that satisfy this criterion. Using only the training set LOOCV error, there is no 
obvious way to choose which number of genes should be used a priori on the 
test set. Indeed, classifiers using different numbers of genes will often lead to 
inconsistent predictions on the test set. 

Instead of choosing one subset of genes out of many as the definitive gene 
1 5 subset to be used on the test set, we instead use many subsets in a weighted 
voting scheme fashion. The gene subsets used corresponds to those sets with 
low LOOCV error. To determine the weight attributed to each subset of genes, 
metrics of classifier assessment other than LOOCV error were used. Once 
LOOCV has been performed, the SVM classifier is then retrained on the entire 
20 training set. 

Let G = {G,,...,G^} denote the collection of gene subsets with low LOOCV 
error, where r is the number of gene subsets. The number of gene subsets, r, 
used in this study was determined by inspection. However, one can easily use 
LOOCV as a mechanism for determining r. Let fiiPj) denote the prediction of 

25 the i* set, G,- , for the patient, pj , in the test set. The final prediction for the 
patient, f{pj) , consists of a linear combination of the predictions made by 
each set: 

where a, is the weight attributed to each gene subset. In this work, is 
30 determined solely from the training set and consists of two components: 
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10 



15 



20 



A margin measure, median , where gXPh) prediction made by 



which is typically positive, is similar in spirit to the median margin metric used 
in (18). 

- The median number of support vectors across r gene subsets. 

The mathematical expression for a is a heuristic one: a, = a,, +0:,2 where 



such that /w,- is the median margin measure, a,, is the normalized margin 
measure, NSV^ is the median number of support vectors obtained using G, as 
the feature set in the SVM classifier and a,2 the normalized reciprocal of the 
number of support vector patients. The larger w, is, the greater the influence 
G, has on the overall vote since larger margins correspond to better separation 
between classes and presumably better separation in the test set. In contrast, the 
larger NSV^ is, the lesser the influence G, has on the overall vote since 

separating hyperplanes determined by fewer support vectors tend to have better 
generalization. 

The SVM and RFE algorithms were written in MATLAB (21). The 
particular SVM algorithm used was based upon the Lagrangian SVM 
formulation of Mangasarian and Musicant (22). The RFE approach with the 
voting scheme extension achieved the highest test set accuracy on the majority 
of the tasks examined in this work. The best test accuracy was achieved for the 
AML/ALL classification task while the performance on the other tasks were 
slightly better than the "majority-class" results — the results obtained if one were 
to always vote with the majority class. This is not surprising since the 
AML/ALL class distinctions tend to "dominate" the gene expression behavior. 
Since SVMs are not dependent upon an a priori and external feature/gene 
reduction procedure and can efficiently fold feature selection into the 
classification process, they will continue to perform well on tasks where the 
class distinctions dominate the gene expression behavior. Non-linear SVMs 



the i*** set, G, , for the k* patient, p^^ , in the training set; this margin measure. 




\INSV, 
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were trained on several of the classification tasks, but their generalization 
performance on the test set, as expected, was far worse than the linear SVM 
classifiers. Since the patients already sparsely populate a very high-dimensional 
gene space, mapping to even higher-dimensional feature space via a nonlinear 
5 kernel will only exacerbate the dilemma of over fitting, a condition already 
made worse due to the disturbingly small size of the training set relative to the 
number of genes and the large amoxmt of experimental noise associated with 
microarray-generated data in general. 

10 2.3 Class Prediction by Linear Discriminant Analysis 

Discriminant analysis is a widely used statistical analysis tool (23). It can be 
applied to classification problems where a training set of samples, depending on 
some set of feature variables, is available. The idea is to find a linear or non- 
linear fimction of the feature variables such that the value of the fimction differs 

15 significantly between different classes. The fimction is the so-called 

discriminant fimction. Once the discriminant fimction has been determined 
using the training set, we can predict the class that a new sample most likely 
belongs to. 

20 Preprocessing: Not all of the original data ware used in our analysis of the infant 
leukemia dataset. We eliminated all control genes (those with accession ID 
starting with the AFFX prefix) and those genes with all calls 'Absent' for all 
142 samples. With these genes removed fi-om the original 12625, we were left 
with 8414 genes. In addition, a natural log transformation was performed on 

25 8414 X 142 matrix of the gene expression values prior to further analysis. 

Selection of Significant Discriminating Genes for Binary Classifications: We 
assumed that the discriminating genes vsdll be those with the most statistically 
significant difference between the two classes in a given binary classification 
30 task. We evaluated each gene by checking if its expression value difiTered 

significantly between the two classes. This was done using the two-sample t- 
test. The larger the absolute value of the /-test statistic J, the greater the 
confidence that there is a difference between the expression values of the two 
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classes. The significance of the difference can be measured via the 
corresponding value, which provides a straightforward means of ranking the 
genes in order of importance. 

5 Class Prediction: Once the genes have been ranked using the /?-value, we need 
to select a subset as our discriminant variables. The expression values of these 
genes in the training set are used to determine a linear discriminant function, 
which discriminates between the two classes and also defines a trained classifier 
for making the class predictions for each sample in the test set. The question is 

10 how to determine the optimal value for n. n must be less than the sample size of 
the training set, otherwise the covariance matrix of the samples in the training 
set will be singular and the discriminant function cannot be determined. Also, if 
n is too large the discriminant function may be over fitted to the data in the 
training set, which may lead to more misclassifications when it is used to make 

1 5 predictions in test set. On the other hand, if n is too small, then the information 
contained in the feature set may be not sufficient for making accurate 
predictions. In practice, different prediction outcomes result when different 
numbers n of prediction genes are used in the classifier. To determine the class 
of a given sample from the test set, we have therefore we have chosen to use a 

20 simple voting scheme. We make a series of predictions with the number n of 
prediction genes varying from 1/3 to 2/3 of the sample size of the training set. 
(For example, if the number of samples in the training set was 85, we computed 
predictions for the given sample from the test set using n=28, 29, 30, 56.) 
The dominant class predicted is then taken as the final prediction result for the 

25 sample. Overall, the results of our discriminant analysis for classification tasks 
were not as good as those of the other multivariate methods (fiizzy logic, 
Bayesian, SVM) applied to these problems. 

2.4 Fuzzy Interference Classification Methodology 

30 

Traditional classification methods are based on the theory of crisp sets, 
where an element is either a member of a particular set or not. However many 
objects encountered in the real world do not fall into precisely defined 
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membership criteria. Alternative forms of data classification, which allows for 
continuous membership gradations, have been investigated and introduced 
fuzzy logic theory (24). 

In many applications, it is easier to produce a linguistic description of a 
5 system than a complex mathematical model. The advantage of fuzzy logic in 
these situations is its ability to describe systems Iinguistic£illy through rule 
statements (25). Expert human knowledge can then be formulated in a 
systematic manner. For example, for a gene regulatory model, one rule 
statement might be: "If the activator A is high and the repressor B is low, then 

1 0 the target C would be high" (26). 

A Fuzzy Inference System (FIS) contains four components: fuzzy rules, 
a fuzzifier, an inference engine, and a "defiizzifier" (27). The fuzzy rules, 
consisting of a collection of IF-THEN rulcs^ define the behavior of the inference 
engine. The membership functions JUf(x) provide measure of the degree of 

1 5 similarity of elements to the fuzzy subset. 

In fuzzy classification, the training algorithm adapts the fuzzy rules and 
membership functions so that the behavior of the inference engine represents 
the sample data sets. The most v^dely used adaptive fuzzy approach is the 
neuro-fuzzy technique, in which learning algorithms developed for neural nets 

20 are modified so that they can also train a fuzzy logic system (28). 

Preprocessing: The infant dataset we used consists of gene expression level for 
12625 probesets on the Affymetrix U95Av2 chip, including 67 control genes, 
measured for 142 patients. The Affymetrix Microarray Suite (MAS) 5.0 assigns 
25 a "Present", "Marginal", or "Absent" call to the computed signal reported for 
each probeset [Affymetrix 2001]. Because of strong observed variations in the 
range of gene expression values across different experiments, it is necessary to 
preprocess the data prior to further analysis. 

In the infant dataset, 17% of all the labels are "Present", 81% are "Marginal", 
30 and 2% are "Absent". We prefer not to eliminate too many probesets at the 
outset. So we choose a loose rule to filter the probesets. We assume that 
"reliable probesets" satisfy the following criteria: 
1 . They are not control genes; 
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2. For a given probeset, at least one label (across all patients) should be 
"Present". 

Under these criteria, 8446 probesets survive. 

For a given patient, the distribution of gene expression values is not 
uniform. It grows exponentially. After filtering, we therefore perform a 
base- 10 logarithmic transformation of the gene expression data. This 
logarithmic transformation scales the data to assist in visualizations, 
remedies right-skewed distributions and makes error components additive 
(29). It also removes systematic variations in experiments. Previously, in 
our analysis of the MIT leukemia dataset (30), we have found that 
logarithmic transformation of the gene expression data improves fuzzy and 
neuro-fuzzy classification accuracies compared to untransformed data. 

Feature Selection: Even after filtering, the dimension of our dataset, 8446, is 
still too large for a classification problem. It is well known that increasing the 
number of features beyond a value of the order of the number of samples can 
actually degrade classification performance rather than improving it (3 1). In 
addition, reducing the dimensionality of the feature space is necessary to 
decrease the cost and time of classification (32). Here we use rank ordering 
statistics for feature selection. 

Our method is as follows. For a given classification task, we rank the genes 
according to the average signal intensity across the patients in each class. We 
then calculate the difference in rank position between the two classes for each 
gene and order these genes with increasing value of the rank difference. The 
larger the absolute difference in rank for a gene, the more important that gene is. 
Rank ordering identifies the genes with the most "discriminating power" for 
distinguishing the two classes. Finally, we select the top 100 genes, 
corresponding to the 100 largest rank ordering differences, as our discriminating 
genes, for input to the fiizzy classifier. 

Classification Approach: The 100 "top" genes determined in the feature 
selection step are in reality an upper boimd for the optimal nimiber, of 
discriminating genes. We note, too, that A:* will vary according to classification 
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task because the training model will be different for each task. Here, we have 
used Leave One Out Cross Validation (LOOCV) to determine k* for each task 



With the number of genes now fixed at k*, we used the labeled training dataset 

15 to generated a Sugeno-type fuzzy inference system using the Fuzzy Logic 

Toolbox in Matlab (34). This uses the fiizzy c-means technique to partition each 
data point to a degree specified by a membership grade, and subtractive 
clustering to initialize the iterative optimization. For comparison, we also 
implemented an adaptive neuro-fuzzy inference system (ANFIS) to tune the 

20 parameters of the fuzzy membership functions based on knowledge leamed 
from the modeling data. Training an ANFIS is an optimization task with the 
goal of finding a set of weights that minimizes an error measure. In our tests, we 
found that this procedwe increased the computational burden significantly, but 
provided only marginal performance improvement. Once the classifier was 

25 trained, we can use it to predict the class type of the test dateiset. For a given 
new patient, the inputs to the FIS are signal intensities of the top k* genes. The 
output of the FIS is the classification result for this patient. The ideal output for 
the ALL class is 1 and the ideal output for the AML class is -1 . The larger the 
distance between the actual prediction and 1/-1 is, the less strong the prediction. 

30 Fuzzy methods share a number of features in common with neural networks and 
with probabilistic methods (such as Bayesian approaches), however they have 
several unique advantages, which suggest interesting avenues for future 
research. In particular, their ability to naturally incorporate non-numeric data 



(33). 



10 



5 



We followed standard LOOCV methodology to compute the prediction 
error of our classification method. This procedure iterated k from 1 to 
100 in the dataset, where k is the nxmiber of top discriminating genes 
training our model. Within each iteration, we iteratively removed a 
single patient from the data set and trained the classification procedure 
using k discriminating genes on the rest of the patients. We then applied 
the trained classifier to the held-out patient and compared the predicted 
class to the true class. The number of prediction errors is^* and the 
LOOCV error is e*. The optimal solution. A:*, corresponds to 




246 



expert into a model, opens the possibility of the use of expert data priors such as 
clinical assessments within the classification system. Similarly, incomplete 
knowledge about gene interrelationships may be incorporated into gene- 
expression-based models of regulatory networks. 

5 

3. Methods for evaluating the performance of class predictors 

Four class predictors — based on the techniques of Bayesian Networks, Support 
Vector Machines (SVM), Fuzzy Inference and Discriminant Analysis, as 

10 described in the previous section — ^have been applied to thirteen supervised 

binary classification tasks using gene expression microarray data for the cohort 
of infant leukemia patients studied in the present work. In this section we 
describe the statistical methods we have used for evaluating the performance of 
the four class predictors based on their prediction results with respect to the 

1 5 thirteen tasks. 

In any binary classification task, there are four possible prediction outcomes 
characterized as true-positive {TP^false-positive (FP), true-negative (TN) and 
false-negative (FN). In the former two instances, a sample is, respectively, 
20 correctly or incorrectly classified into Class A, while the latter two instances 

correspond to classification into Not-Class A. Consequently, the performance of 
a class predictor can always be completely summarized in terms of a 2x2 matrix 
as shown in Table 48. 

25 
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Table 48. Prediction Outcome Probabilities of a Class Predictor 



Original 


Predicted Classes 


Row 
Sum 


Classes 


Class A 


Not-Class A 


Class A 


TP = true-positive probability 


FN = false-negative 
probability 


1 


Not-Class A 


FP = false-positive 
probability 


TN = true-negative 
probability 


1 



Note that because each row sums to 1 only one quantity is required from each 
5 row in order to determine the entire matrix. In other words, there are only two 
independent quantities in Table 48. These can be regarded as evaluating the 
different aspects of the class predictor's performance. Improving a class 
predictor's performance in TP may lower its 7W, while its TN may be improved 
at the cost of reducing of its TP. In order to evaluate the overall performance of 
10 a class predictor, therefore, a measure that combines the two independent 
quantities is needed. 

We considered two such overall measures: the success rate r, and the odds ratio 
OR, The success rate is defined as the probability of correct prediction. This is 

1 5 just a weighted average of TP and TN: 

r=-wiTP + W2 TN, [1] 
where w\ = actual proportion of Class A in the test set, and W2 = I - w\. TP and 
77V are intrinsic values associated with a given predictor, and are unknown; 
therefore r is also unknown and must be estimated. A commonly used point 

20 estimate of r, which we have utilized here, is the ratio of the number of correct 
predictions to the total number of predictions. We have also computed the 95% 
confidence intervals of r (35). Finally, we have performed a significance test to 
evaluate the extent to which the performance of a predictor differs from what 
would have been obtained by chance alone. This is equivalent to testing the 

25 statistical hypotheses 

Ho: r = 0.5 verses Ha: r > 0.5. [2] 

If the value (35) of the test is no larger than a given significance level cr(here, 

we have set a= 0.05 and a= 0.01), then we reject the null hypothesis //band 
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conclude that the difference is significant at level or. The p- value is closely 
related to the success rate: the larger the success rate, the smaller the expected 
p- value. Thus, either success rate or the /7-value can be used to measure the 
performance of a predictor. For each of four class predictors, and with respect to 
5 each of thirteen tasks, we have computed the point estimate and confidence 
interval of r. These are presented in Table 48, along with the /?- value 
corresponding to the statistical test of hypotheses [2]. 

The second overall measure that we utilized is the odds ratio (OR). Since a good 
10 class predictor should simultaneously satisfy 

TP>FN and FP < TN, [3] 
or equivalently, 

TP/FN>\ and FP/TN<\, [4] 

this implies that the ratio of the right hand sides of the inequalities in [4], i.e., 

TP /FN 

15 OR= , [5] 

FPITN 

should be large (at least larger than 1). Hence this ratio — ^known as the odds 
ratio (29)— can be utilized as an overall measure for evaluating the class 
predictor's performance. For each of the four class predictors and each of the 
thirteen tasks, the estimated value of OR and its 95% exact confidence interval 
20 (36) have been calculated through the use of SAS package (37), and the results 
are listed in Table 49. 

Above, we observed that the expected values for the TP and FP of a good class 
predictor should satisfy TP > FP or TP/FP > 1, which is mathematically 
25 equivalent to OR > 1 . This suggests that the performance of a classifier can 
altematively be evaluated by testing the following hypotheses: 
Ho: TP < FP vs. Ha: TP > FP, [6] 
or equivalently 

Ho: OR < 1 vs. Ha: OR > 1. [7] 
30 Hence the p-value of the test also serves as a good measure for evaluating the 
performance of the class predictor. An uniformly most powerful unbiased test — 
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known as Fisher's exact test (38) — has been used to test the hypotheses [7] and 
the /7-values of the test are given in Table 49. 

From Tables 48 and 49 it is evident that all of the four class predictors 
5 performed well on Tasks 1 and 3. The statistical test for hypotheses [2] rejects 
the null hypothesis Hq and we may conclude that the predictions made by the 
four class predictors on these tasks are significantly better than those made by 
chance, at level a= 0.01. Fisher's exact test yields the similar results, except 
that for two of the predictors (fuzzy inference and discriminant analysis), the 
10 significance level for Task 3 predictions is a - 0,05. 
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4. Unsupervised methods - Clustering methodology 

Three types of methodologies were used in the clustering analysis, 
namely agglomerative hierarchical clustering. Principal Component Analysis 
and a force-directed clustering algorithm coupled with the Vximight 
5 visualization tool. 

4.1 Agglomerative Hierarchical clustering 

The grouping together, or clustering, of genes with similar patterns of 
expression is based on the mathematical measure of their similarity, e.g. the 

10 Euclidian distance, angle or dot products of the two w-dimensional vectors of a 
series of n measurements. Biological interpretation of DNA microarray 
hybridization gene expression data has utilized clustering to re-order genes, and 
conversely samples into groups which reflect inherent biological similarity. 
Clustering methods can be divided into two classes, supervised and 

1 5 unsupervised. In supervised clustering vectors are classified with respect to 

known reference vectors. Unsupervised clustering uses no defined vectors. With 
a diverse dataset of 126 infant leukemia patients and our intent to discover 
unique patterns within, we chose to use an unsupervised clustering approach. In 
addition, combining the ordered list of genes and patients with a graphical 

20 presentation of each data point using relative value-color, termed a "heat map", 
aids the viewer in an intuitive manner. Several computer software programs 
allow one to cluster significant samples and genes and create graphical output 
(Cluster, Genespring, GeneCluster). 

We have applied the Eisen (39) Cluster algorithm utilizing pair wise 

25 average-linkage cluster analysis to gene expression data from Affymetrix 
U95Av2 arrays. Genes were selected for this analysis if the Affymetrix 
Microarray Analysis Software v. 5.0 predicted at least 1 of 126 patient data 
were "Present". The resulting 8,358 genes were z-scored across patients and the 
standard deviation determined. The clustering algorithm of genes is as follows: 

30 the distance between two genes is defined as 1-r where r is the correlation 

coefficient between the 252 values of the two genes across samples. Two genes 
with the closest distance are first merged into a super-gene and connected by 
branches with length representing their distance, and are deleted from future 
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merging. The expression level of the newly formed super-gene is the average of 
standardized expression levels of the two genes (average-linked) across 
samples. Then the next super-gene with the smallest distance is chosen to 
merge and the process repeated 8,352 times to merge all 8,353 genes. 

5 

4.2 Principal Component Analysis 

Principal component analysis (PCA) is a well-known and convenient 
method for performing unsupervised clustering of high-dimensional data. 
Closely related to the Singular Value Decomposition (SVD), PCA is an 

10 unsupervised data analysis technique whereby the most variance is captured in 
the least number of coordinates (40-42). It can serve to reduce the 
dimensionality of the data while also providing significant noise reduction. 
PCA can also be applied to gene-expression data obtained from microarray 
experiments. When gene expressions are available from a large number of 

15 genes and from numerous samples, then the noise suppression and dimension 
reduction properties of PCA can greatly facilitate and simplify the examination 
and interpretation of the data. In any microarray experiment, the expression 
profiles of many genes are monitored simultaneously. Because many genes are 
often up or dovm regulated in similar pattems in the cells, these responses are 

20 correlated. PCA can identify the uncorrelated or independent sources of 

variation in the gene expression data from multiple samples. Since random 
noise tends to be uncorrelated with the signal, PCA does an effective job at 
separating the signal from the noise in the data. 

If the gene expression values from each microarray are written as row 

25 vectors, then the entire data set from multiple microarray samples can be 

represented by a data matrix whose rows represent the gene expressions from 
each microarray chip. PCA can greatly reduce the complexity and 
dimensionality of the data by factor analyzing the data matrix into the product 
of two much smaller matrices. The two smaller matrices are known as scores 

30 and loading vectors (or eigenvectors). The decomposition is often achieved with 
a method known as singular value decomposition (SVD). PCA has the unique 
property that the decomposition is performed such that the rows of the score 
matrix are orthogonal and the columns of the eigenvector matrix are also 
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orthogonal. Although there is a strict mathematical definition of orthogonal, 
orthogonal vectors are simply independent and uncorrelated with one another. 
Therefore, these vectors represent unique sources of variation in the microarray 
data. Another property of the eigenvectors is that they are calculated such that 
5 the first eigenvector represents the largest source of variance in the data, the 
second represents the next largest unique source of variance in the data, and so 
on. Since we generally expect the signal in the data to be larger than the noise 
and since random noise is approximately orthogonal to the signal, PCA has the 
ability to separate the noise from signal that we are interested in. By ignoring 

10 the eigenvectors with low variance, we can observe the portion of the data that 
contains primarily signal. 

The scores matrix represents the amounts of each eigenvector in each 
sample that are required to reproduce the data matrix. When we eliminate the 
noisier eigenvectors we also eliminate their associated scores. The scores 

1 5 represent a compressed form of the data matrix in the new coordinate system of 
the eigenvectors. Since scores are derived from the expression of many genes 
and many samples, they have much higher signal-to-noise ratios than the 
individual gene expressions upon which they are based. A plot of the scores for 
each microarray for each eigenvector then is a new compressed form of the 

20 gene expression data for all samples. 2D plots of one set of scores vs. another 
for two selected eigenvectors allow us an examination of the microarray data in 
the compressed PCA space so that we can readily observe clusters in expression 
data. 3D plots are also possible when the scores from three selected 
eigenvectors are displayed. Statistical metrics can be used to identify groupings 

25 or clusters in the data in 2, 3, or higher dimensions that cannot be readily 

viewed graphically. All the statistical supervised and vmsupervised clustering 
methods that are based on individual genes or groups of genes can be applied to 
the scores representation of the data. 

The first three Principal Components partition the infant cohort into two 

30 different groups. Interestingly, these groups display a weak correlation with the 
infant ALL/AML lineage membership (and none with the MLL cytogenetics), 
although the correlation is not seen until the second PC. This indicates, 
according to the theory behind PCA, that the ALL/AML distinction is not the 
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driving force behind the representation of the patient cohort. The first (and 
most important) Principzil Component, on the other hand, does not reveal any 
obvious clusters. Upon further analysis, however, we did find an additional 
interesting group correlated with the first Principal Component. This group was 
5 discovered by a force-directed graph layout algorithm and the Vxinsight® 
visualization program (43, 44). 

4,3 Vxinsight and the force directed clustering algorithm 

This clustering algorithm places genes into clusters such that the sum of two 

10 opposing forces is minimized. One of these forces is repulsive and pushes pairs 
of genes away from each other as a function of the density of genes in the local 
area. The other force pulls pairs of similar genes together based on their degree 
of similarity^ The clustering algorithm stops when these forces are in 
equilibrium. Every gene has some correlation with every other gene; however, 

1 5 most of these are not strong correlations and may only reflect random 

fluctuations. By using only the top few genes most similar to a particular gene 
as it is placed into a cluster we obtain two benefits. First, the algorithm runs 
much faster. Second, as the number of similar genes is reduced, the average 
influence of the other, mostly uncorrelated genes diminishes. This change 

20 allows the formation of clusters even when the signals are quite weak. However, 
when too few genes are used in the process, the clusters break up into tiny 
random islands, so selecting this parameter is an iterative process. One trades 
off confidence in the reliability of the cluster against refinement into sub- 
clusters that may suggest biologically important hypotheses. These clusters are 

25 only interpreted as suggestions, and require further laboratory and literature 
work before we assign them any biological importance. However, without 
accepting this trade off, it may be impossible to uncover any suggestive 
structure in the collected data. For example, we clustered using the twenty other 
genes most strongly similar to each gene. When we re-cluster using only the top 

30 ten most strongly similar genes, the observed clusters have broken up into 

smaller groups. We carefully analyzed these for biological support and believe 
that they may be suggestive of weak, but important groupings in our 
experimental data. Vxinsight was employed to identify clusters of patients with 
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similar gene expression patterns, and then to identify which genes strongly 
contributed to the separations. That process created lists of genes, which when 
combined with public databases and research experience, suggest possible 
biological significances for those clusters. The array expression data were 
5 clustered by rows (similar genes clustered together), and by columns (patients 
with similar gene expression clustered together). In both cases Pearson's R was 
used to estimate the similarities. These similarities were used together with a 
force-directed, two-dimensional clustering algorithm (43, 44) to produce maps 
showing clusters of genes and patients. Different maps were generated by using 

10 the top twenty, top ten and top five strongest correlations for each gene (using 
more similarity links between genes generates more stable clusters, while using 
fewer links leads to finer, if less stable, divisions). This methodology has been 
useful in inferring functions of uncharacterized genes clustered near other genes 
with known functions (45, 46), and did contribute to our analysis here, too. 

1 5 However, patients were the main focus of this study and most of the analysis 
revolved around the map of patient clusters. Analysis of variance (ANOVA) 
was used to determine which genes had the strongest differences between pairs 
of patient clusters. These gene lists were sorted into decreasing order based on 
the resulting F-scores, and were presented in an HTML format with links to the 

20 associated OMIM pages, which were manually examined to hjT)othesize 
biological differences between the clusters. 

We also investigated the stability of those gene lists using statistical 
bootstraps (47, 48). For each pair of clusters we computed 1000 random 
bootstrap cases (resampling with replacement from the observed expressions) 

25 and computed the resulting ordered lists of genes using the same ANOVA 
method as before. The average order in the set of bootstrapped gene lists was 
computed for all genes, and reported as an indication of rank order stability (the 
percentile firom the bootstraps estimates a p-value for observing a gene at or 
above the list order observed using the original experimental values). 

30 Because the force directed placement algorithm used by Vxinsight has a 
stochastic element (random initial starting conditions), we used massively 
parallel computers to calculate hundreds of reclustering with different seeds for 
the random nimiber generator. We compared pairs of ordinations by counting, 
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for every gene, the number of common neighbors found in each ordination. 
Typically, we looked in a region containing the 20 nearest neighbors around 
each gene, in which case one could find (around each gene) a minimum of 0 
common neighbors in the two ordinations, or a maximimi of 20 common 
5 neighbors. By simiming across every one of the genes an overall comparison of 
similarity of the two ordinations can be computed. We computed all pair wise 
comparisons between the randomly restarted ordinations and found the 
ordination that had the largest count of similar neighbors across the totality of 
all the comparisons. Note that this corresponds to finding the ordination whose 

1 0 comparison with all the others has minimal entropy, and in a general sense 

represents the most central ordination (MCO) of the entire set. It is possible to 
use these comparison counts (or entropies) as similarity measures to compute 
another round of ordinations. The clusters from this recursive use of the 
ordination algorithm are generally smaller, much tighter, and are generally more 

1 5 stable with respect to random starting conditions than any single ordination. We 
used all of these methods during exploratory data analysis to develop intuition 
about the data. 

5 . Lists of Informative Genes 

20 

Table 51. Discriminating genes that distinguish between ALL and AML types, 
derived from Bayesian networks analysis. 

A. Bayesian Networks 

25 

Affymetrix Gene description Gene 
Locus 

number symbol 



1 


38269 at 


protein kinase D2 


PKD2 




19q13.2 






2 


40103 at 


villin 2 (ezrin) 


VI L2 




6q25-q26 






3 


41165 _g at 


in)munoglobulin heavy constant mu 


IGHM 




14q32.33 






4 


40310 at 


toll-like receptor 2 


TLR2 




4q32 






5 


38604 at 


neuropeptide Y 


NPY 




7p15.1 






6 


39689 at 


cystatin C 


CST3 




20p11.2 






7 


41356 at 


B-cell CLLTIymphoma 11A 


BCL11A 




2p15 
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8 


461 at 


N-acylsphingosine amidohydrolase 


ASAH 






8p22-p21.3 








9 


1096 _jg at 


CD19 antigen 


CD19 






16p11.2 






5 


10 


36938 at 


N-acylsphingosine amidohydrolase 


ASAH 






8p22-p21.3 








11 


41401 at 


cysteine and glycine-rich protein 2 


CSRP2 






12q21.1 








12 


41523 at 


RAB32, member RAS oncogene family 


RAB32 


10 




6q24.2 








13 


40432 at 


Homo sapiens, clone IMAGE:4391536 






14 


41164 at 


immunoglobulin heavy constant mu 


IGHM 






14q32.33 








15 


36766 at 


ribonuclease, RNase A family, 2 


RNASE2 


15 




14q24-q31 






16 


39827_at 


hypothetical protein 


FLJ20500 






10pterq26 








17 


37001 at 


calpain 2, (m/ll) large subunit 


CAPN2 






1q41-q42 






20 


18 


279 at 


nuclear receptor subfamily 4 


NR4A1 






12q13 








19 


39593 at 


Similar to fibrinogen-like 2, clone 






20 


41038 at 


neutrophil cytosolic factor 2 


NCF2 






1q25 






25 


21 


40936 at 


cysteine-rich motor neuron 1 


CRIM1 






2p21 








22 


32227 at 


proteoglycan 1, secretory granule 


PRG1 






10q22.1 








23 


478 g at 


interferon regulatory factor 5 


IRF5 


30 




7q32 








24 


1230 _g at 


cisplatin resistance associated 


CRA 






1q12-q21 








25 


35367 at 


lectin, galactoside-binding, soluble 


LGALS3 






14q21-q22 






35 












Table 52. Discriminating genes that distinguish between ALL and AML types. 




derived horn SVM analysis. 




40 












B. 


SVM 










Affymetrix 


Gene description 


Gene 






Locus 






45 




number 




symbol 




1 


41165_g at 


immunoglobulin heavy constant mu 


IGHM 






14q32.33 








2 


36766 at 


ribonuclease, RNase A family, 2 


RNASE2 


50 




14q24 






3 


38604 at 


neuropeptide Y 


NPY 






7p15.1 








4 


36879 at 


endothelial cell growth factor 1 


ECGF1 






22q13.33 






55 






(platelet-derived) 






5 


41401 at 


cysteine and glycine-rich protein 2 


CSRP2 






12q21.1 








6 


36638 at 


connective tissue growth factor 


CTGF 






6q23.1 







259 



33856_at CAAX box 1 
Xq26 



CXX1 



5 

Table 52. (Continuation) Discriminating genes (between ALL and AML types) 
derived from SVM analysis. 



10 




Affymetrix 
Locus 
number 


Gene description 


Gene 
symbol 




8 


35926 s at 


leukocyte immunoglobulin-like receptor, B 


LILRB1 






19q13.4 






15 


9 


40659 at 


nuclear receptor subfamily 4, group A, member 3 


NR4A3 






9q22 








10 


266 s at 


CD24 antigen (small cell lung carcinoma cluster 4) CD24 






6q21 








11 


34180 at 


Rho guanine nucleotide exchange factor (GEF) 10 ARHGEF 


20 




8p23 








12 


279 at 


nuclear receptor subfamily 4, group A, member 1 


NR4A1 






12q13 








13 


38661 at 


seb4D 


HSRNA 






20q13.31 






25 


14 


38363 at 


TYRO protein tyrosine kinase binding protein 


P«'ROBP 






19q13.1 








15 


36657 at 


apollpoprotein C-ll 


APOC2 






19q13.2 








16 


37050 r at 


translocase of outer mitochondrial membrane 34 


TOM34 


30 


17 


41523 at 


RAB32, member RAS oncogene family 


RAB32 






6q24.2 








18 


39878 at 


protocadherin 9 


PCDH9 






13q14.3 








19 


41577 at 


protein phosphatase 1 , regulatory (inhibitor) 


PPP1R1 


35 




20q11.23 








20 


854 at 


B lymphoid tyrosine kinase 


BLK 






8p23-p22 








21 


38403 at 


lysosomal-assoclated membrane protein 2 


LAMP2 






Xq24 






40 


22 


39994 at 


chemokine (C-C motiO receptor 1 


CCR1 






3p21 








23 


33186 i at 


ESTs 






24 


32227 at 


proteoglycan 1, secretory granule 


PRG1 






10q22.1 






45 


25 


39827_at 


hypothetical protein 


FLJ20500 






10pterq26 








26 


40103 at 


villin 2 (ezrin) 


VIL2 






6q25-q26 








27 


34168 at 


deoxynucleotidyltransferase, terminal 


DNTT 


50 




10q23 








28 


36465 at 


interferon regulatory fector 5 


IRF5 






7q32 








29 


34433 at 


docking protein 1 


DOK1 






2p13 






55 


30 


41239 r at 


cathepsin S 


CTSS 






1q21 








31 


40457 at 


splicing factor, arginine/serine-rich 3 


SFRS3 



11 



260 



32827_at 
11pter- 



related RAS viral (r-ras) oncogene homolog 2 



RRAS2 



10 



15 



20 



25 



30 



35 



40 



45 





pi 5.5 






33 


33678 i at 


tubulin, beta, 2 


TUBB2 


34 


40936 at 


cysteine-rich motor neuron 1 


CRIM1 




2p21 






35 


38242 at 


B-cell linker 


BLNK 




10q23.2- 








q23.33 






36 


41164 at 


immunoglobulin heavy constant mu 


IGHM 




14q32.33 






37 


40220 at 


HMBA-inducible 


HIS1 




17q21.32 






38 


40310 at 


toll-like receptor 2 


TLR2 




4q32 






39 


39593 at 


Similar to fibrinogen-like 2, IMAGE:4616866 




40 


37844 at 


class 1 cytokine receptor 


WSX-1 




19p13.11 






41 


478 g at 


interferon regulatory factor 5 


IRF5 




7q32 






42 


38138 at 


S100 calcium-binding protein A11 (calgizzarin) 


S100A11 




1q21 






43 


40282 s at 


D component of complement (adipsin) 


DF 




19p13.3 






44 


36928 at 


zinc finger protein 146 


ZNF146 




19q13.1 






45 


34800 at 


ortholog of mouse integral membrane glycoprotein LIG1 


46 


33462_at 


G protein-coupled receptor 105 


GPR105 




3q21-q25 






47 


34950 at 


OLF-1/EBF associated zinc finger gene 


OAZ 




16q12 






48 


34335 at 


ephrin-B2 


EFNB2 




13q33 






49 


37190 at 


WAS protein family, member 1 


WASF1 




6q21-q22 






50 


40195 at 


H2A histone family, member X 


H2AFX 




11q23.2- 








q23.3 






51 


38037 at 


diphtheria toxin receptor 


DTR 




5q23 






52 


38994 at 


STAT induced STAT inhibitor-2 


STATI2 




12q 







50 



55 



Table 52. (Continuation). Discriminating genes (between ALL and AML types) 
derived from SVM analysis. 



Affymetrix 
Locus 
number 



Gene description 



Gene 
symbol 
HLA-DPB 



53 38096_f_at MHC class II. DP beta 1 
6p21,3 

54 2063_at excision repair cross-complementing rodent repair ERCC5 
13q22 

deficiency, complementation group 5 (xeroderma 



261 



pigmentosum, complementation group G) 





55 


461 at 


N-acylsphingosine amidohydrolase 


ASAH 






8p22- 






5 




p21.3 








56 


35449 at 


killer cell lectin-like receptor subfamily B - 1 


KLRB1 






12p13 








57 


41198 at 


granulin 


GRN 






17q21.32 






10 


58 


38993 r at 


Homo sapiens cDNA: clone HEP03585 






59 


34677 f at 


Homo sapiens mRNA for TL1 32 






60 


33899 at 


aldehyde dehydrogenase 9 family, member A1 


ALDH9A1 






1q22-q23 








61 


40814 at 


iduronate 2-sulfatase (Hunter syndrome) 


IDS 


15 




Xq28 








62 


33228 g at 


interieukin 10 receptor, beta 


IL10RB 






21q22.11 








63 


33458 r at 


H2B histone family, member L 


H2BFL 






6p21.3 






20 


64 


41356 at 


B-cell CLL/lymphoma 11 A (zinc finger protein) 


BCL11A 






2p15 








65 


40638 at 


splicing factor proline/glutamine rich 


SFPQ 






1p34.2 












(polypyrimidine tract-binding protein-associated) 




25 


66 


40570 at 


forkhead box 01A (rhabdomyosarcoma) 


F0X01A 






13q14.1 








67 


40432_at 


Homo sapiens, clone IMAGE:4391536. mRNA 






68 


39398 s at 


tubu tin-specific chaperone d 


TBCD 






17q25.3 






30 


69 


2003 s at 


mutS (E. coli) homolog 6 


MSH6 






2p16 








70 


37561 at 


Human DNA sequence from clone 34B21 on 








6p12.1 












chromosome 




35 


71 


41038 at 


neutrophil cytosolic factor 2 


NCF2 






1q25 








72 


38402 at 


lysosomal-associated membrane protein 2 


LAMP2 






Xq24 








73 


37203 at 


carboxyiesterase 1 (monocyte/macrophage serine CES1 


40 




16q13- 










esterase 1) 










q22.1 








74 


34749 at 


solute carrier femily 31 (copper transporters) 


SLC31A2 






9q31-q32 






45 


75 


40601 at 


beta-amyloid binding protein precursor 


BBP 






1p31.2 








76 


40194 at 


Human chromosome 5q13.1 clone 5G8 mRNA 






77 


39566 at 


cholinergic receptor, nicotinic, alpha polypeptide 7 CHRNA7 






15q14 






50 


78 


32706 at 


HIR (histone cell cycle regulation defective) 


HIRA 






22q11.21 







262 



Table 53. Discriminating genes that distinguish between remission and fail 
overall derived from SVM analysis. 



Affymetrix 
Locus 
number 



Gene description 



Gene 
symbol 



1 
2 
3 
4 

5 
6 

7 
8 
9 

10 
11 

12 
13 
14 
15 
16 
17 
18 

19 



41165 _g_at 
14q32.33 
39389_at 
12p13 

41058_g_at 
6p22.2 
31459J_at 
22q11.1- 

q11.2 

38389_at 

12q24.1 

37504_at 

7q21.1- 

q31.1 

40367_at 

20p12 

32637_r_at 

16p12.3 

39931_at 

1q32 

37054_at 
20q11 
1404_r_at 
17q11.2- 

q12 

1292_at 
2q11 

37709_at 

Xp22.32 

36857_at 

5p13;2 

41196_at 

17q21 

1182_at 

2q33 

34961_at 

3q13.13 

37862_at 

1p31 



38772_at 
1p31- 



immunoglobulin heavy constant mu IGHM 

CD9 antigen (p24) CD9 

uncharacterized hypothalamus protein HT012 HT012 

immunoglobulin lambda locus IGL 

2',5'-oligoadenylate synthetase 1 (40-46 kD) OAS1 

E3 ubiquitin iigase SMURF1 SMURF1 

bone morphogenetic protein 2 BMP2 

PI-3-kinase-related kinase SM6-1 SMG1 

dual-SF>ecificity tyrosine-(Y)-phosphorylation DYRK3 
regulated kinase 3 

bactericidal/permeability-increasing protein BPI 

small inducible cytokine A5 (RANTES) SCYA5 

dual specificity phosphatase 2 DUSP2 

DNA segment, numerous copies DXF68 

RAD1 (S. pombe) homolog RAD1 

karyopherin (importin) beta 1 KPNB1 

phospholipase C, epsilon PLCE 

T cell activation, increased late expression TACTILE 

dihydrolipoamide branched chain transacylase DBT 

(E2 component of branched chain keto acid 
dehydrogenase complex; maple syrup disease) 

cysteine-rich, angiogenic inducer, 61 CYR61 



p22 



263 



20 


33208 at 


DnaJ (Hsp40) homolog, subfamily C, member 3 


DNAJC3 




13q32 






21 


37837 at 


KIAA0863 protein 


KIAA0863 




18q23 




22 


34031 i at 


cerebral cavernous malformations 1 


CCM1 




7q21 






23 


38220 at 


dihydropyrimidine dehydrogenase 


DPYD 




1p22 






24 


34684 at 


RecQ protein-like (DNA helicase Ql-like) 


RECQL 




12p12 






25 


39449 at 


S-phase kinase-associated protein 2 (p45) 


SKP2 




5p13 






26 


32638 s at 


PI-3-kinase-related kinase SMG-1 


SMG1 




16p12.3 






27 


35957 at 


stannin 


SNN 




16p13 






28 


34363 at 


selenoprotein P, plasma, 1 


SEPP1 




5q31 






29 


35431 g at 


RNA polymerase II transcriptional regulation 


MED6 




14q24.1 










mediator (Med6, S. cerevisiae, homolog oO 




30 


35012 at 


myeloid cell nuclear differentiation antigen 


MNDA 




1q22 






31 


38432 at 


interferon-stimulated protein, 15 kDa 


ISG15 




1p36.33 






32 


35664 at 


multimerin 


MMRN 




4q22 






33 


41862 at 


KIAA0056 protein 


KIAA0056 




11q25 






34 


33210 at 


YY1 transcription factor 


YY1 




14q 






35 


35794_at 


KIAA0942 protein 


KIAA0942 




8pter 






36 


36108 at 


HLA, class II, DQ beta 1 


DQB1 




6p21.3 






37 


35614 at 


transcription factor-like 5 (basic helix-loop-helix) 


TCFL5 




20q13.3 






38 


32089 at 


sperm associated antigen 6 


SPAG6 




10p12 







Table 53. (Continuation). Discriminating genes that 


distinguish between 


remissions and fails overall derived from SVM aneilysis. 




Affymetrix 


Gene description 


Gene 


Locus 






number 




symbol 


39 1343 s at 


serine (or cysteine) proteinase inhibitor) 


SERPINB 


18q21.3 






40 665 at 


serine/threonine kinase 2 


STK2 


3p21.1 






41 40901 at 


nuclear autoantigen 


GS2NA 


14q13 






42 39299 at 


KIAA0971 protein 


KIAA0971 


2q34 








264 





10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



43 


34446 at 


KIAA0471 gene product 


KIAA0471 




1q24 






44 


33956 at 


MD-2 protein 


MD-2 




8q13.3 






45 


37184 at 


syntaxin 1A (brain) 


STX1A 




7q11.23 






46 


1773 at 


farnesyltransferase, CAAX box, beta 


FNTB 




14q23 






47 


34731_at 


KIAA0185 protein 


KIAA0185 


10q24.32 








48 


41700 at 


coagulation factor II (thrombin) receptor 


F2R 




5q13 






49 


38407 r at 


prostaglandin D2 synthase (21 kD, brain) 


GDS 




9q34.2 






50 


40088 at 


nuclear receptor interacting protein 1 


NRIP1 




21q11.2 






51 


33124 at 


vaccinia related kinase 2 


VRK2 




2p16 






52 


32964 at 


egf-like module containing, mucin-iike, hormone 


EMR1 




19p13.3 










receptor-like sequence 1 




53 


39560 at 


chromobox homolog 6 


CBX6 




22q13.1 






54 


39838 at 


CLIP-associating protein 1 


CLASP1 




2q14.2 






55 


40166_at 


CS box-containing WD protein 


LOC55884 


56 


36927 at 


hypothetical protein, expressed in osteoblast 


GS3686 




1p22.3 






57 


41393 at 


zinc finger protein 195 


ZNF195 




11 pi 5.5 






58 


35041 at 


neurotrophin 3 


NTF3 




12p13 






59 


40238 at 


G protein-coupled receptor, family C, group 5. 


GPRC5B 




16p12 






60 


39926 at 


MAD (mothers against decapentaplegic, Drosoph) MADH5 




5q31 






61 


36674 at 


small inducible cytokine A4 


SCYA4 




17q21 






62 


32132 at 


KIAA0675 gene product 


KIAA0675 




3q13.13 






63 


38252 s at 


1 ,6-glucosidase, 4-alpha-glucanotransferase 


AGL 




1p21 






64 


33598 r at 


cold autoinflammatory syndrome 1 


CIAS1 




1q44 






65 


37409 at 


SFRS protein kinase 2 


SRPK2 




7q22 






66 


41019 at 


phosducin-like 


PDCL 




9q12 






67 


1113 at 


bone morphogenetic protein 2 


BMP2 




20p12 






68 


37208 at 


phosphoserine phosphatase-like 


PSPHL 




7q11.2 






69 


32822 at 


solute carrier family 25 


SLC25A4 




4q35 






70 


32249 at 


H factor (complement)-like 1 


HFL1 




1q32 






71 


39600 at 


EST 




72 


32648 at 


delta-like homolog (Drosophila) 


DLK1 




14q32 







265 





73 


39269 at 


replication factor C (activator 1) 3 (38kD) 


RFC3 






13q12.3 








74 


37724 at 


v-myc avian myelocytomatosis viral oncogene 


MYC 






8q24.12 






5 


75 


35606 at 


histidine decarboxylase 


HDC 






15q21 








76 


31926 at 


cytochrome P450, subfamily VIIA 


CYP7A1 






8q11 








77 


32142 at 


serine/threonine kinase 3 (Ste20, yeast homolog) 


STK3 


10 




8p22 








78 


32789 at 


nuclear cap binding protein subunit 2, 20kO 


NCBP2 






3q29 








79 


37279 at 


GTP-binding protein (skeletal muscle) 


GEM 






8q13 






15 


80 


40246 at 


discs, large (Orosophila) homolog 1 


DLG1 






3q29 








81 


37547 at 


PTH-responsive osteosarcoma B1 protein 


B1 






7p14 








82 


32298 at 


a disintegrin and metalloproteinase domain 2 


ADAM2 


20 




8p11.2 








83 


40496 at 


complement component 1 , s subcomponent 


CIS 






12p13 








84 


39032 at 


transforming growth factor beta-stimulated protein TSC22 






13q14 






25 












Table 54. Discriminating genes that distinguish between remission and fail. 




inside the ALL tj^e, derived from SVM. 




30 




Affymetrix 


Gene description 


Gene 






Locus 










number 




symt>ol 


35 


1 


39389 at 


CD9 antigen (p24) 


CD9 






12p13 








2 


1292 at 


dual specificity phosphatase 2 


DUSP2 






2q11 








3 


31459 i at 


immunoglobulin lambda locus 


IGL 


40 




22q11.1 








4 


36674 at 


small inducible cytokine A4 


SCYA4 






17q21 








5 


32637 r at 


PI-3-kinase-related kinase SMG-1 


SMG1 






16p12.3 






45 


6 


35756 at 


chromosome 19 open reading frame 3 


C19orf3 






19p13.1 








7 


41700 at 


coagulation factor II (thrombin) receptor 


F2R 






5q13 








8 


31853 at 


embryonic ectoderm development 


EED 


50 




11q14.2 








9 


31329_at 


putative opioid receptor, neuromedin K 


TAC3RL 








(neurokinin 6) receptor-like 






10 


34491 at 


2'-5'-oligoadenylate synthetase-like 


OASL 






12q24.2 






55 


11 


34961 at 


T cell activation, increased late expression 


TACTILE 






3q13.13 








12 


160021_r_at progesterone receptor 


PGR 






11q22 







266 



10 



13 
14 
15 
16 
17 
18 
19 

15 20 
21 
22 

20 

23 
24 

25 25 
26 
27 

30 

28 

35 29 
30 
31 

40 

32 
33 

45 34 
35 
36 

50 

37 
38 

55 39 



40 



60 



37773_at 
16 

38367_s_at 

1q32 

32279_at 

10p11 

36108_at 

6p21.3 

34378_at 

9p21.3 

777_at 

10p15 

35140_at 

13q12 

33208_at 

13q32 

33405_at 

6p22.3 

39580_at 

9q34.3 

32469_at 

19q13.2 

38539_at 

15q22 

1454_at 

15q21 

35289_at 

9q34.11 

37724_at 

8q24.12- 

q24.13 

32521_at 

8p12 

1375_s_at 

17q25 

555_at 

17q25.3 

224_at 

8q22.2 

40367_at 

20p12 

41504_s_at 

16q22 

40166_at 

35228_at 

22q13 

33491_at 

3q25.2 

1182_at 

2q33 

38869_at 

3q25.31 

3581 1_at 

3q25.1 

37504_at 

7q21.1- 

q31.1 



KIAA1 005 protein KIAA1 005 
complement component 4-binding protein, beta C4BPB 

glutamate decarboxylase 2 GAD2 

MHC complex, class II. DQ beta 1 0QB1 

adipose differentiation-related protein ADFP 

GDP dissociation inhibitor 2 GDI2 

cyclin-dependent kinase 8 CDK8 
DnaJ (Hsp40) homolog, subfamily C, member 3 DNAJC3 

adenylyl cyclase-associated protein 2 CAP2 

KIAA0649 gene product KIAA0649 

carcinoembryonic antigen- cell adhesion 3 CEACAM 

solute carrier family 24, member 1 SLC24A1 

MAD (mothers against decapentaplegic) 3 MADH3 

rab6 GTPase activating protein GPCENA 
v-myc avian myelocytomatosis viral oncogene MYC 

secreted frizzled-related protein 1 SFRP1 

tissue inhibitor of metalloproteinase 2 TIMP2 

GTP-binding protein homologous SEC4 

TGFB inducible early growth response TIEG 

bone morphogenetic protein 2 BMP2 

v-maf aponeurotic fibrosarcoma oncogene MAP 

CS box-containing WD protein LOC55884 

carnitine palmitoyltransferase I, muscle CPT1B 

sucrase-isomaltase SI 

phospholipase C, epsilon PLCE 

KIAA1 069 protein KIAA1069 

ring finger protein 13 RNF13 

E3 ubiquitin ligase SMURF1 SMURF1 



267 



10 



41 160025_at transforming growth factor, alpha TGFA 
2p13 

42 35233_r_at centrin, EF-hand protein, 3 (CDC31 yeast) CETN3 
5q14.3 

43 403g9_r_at nnesenchyme homeo box 2 (growth arrest) MEOX2 
7p22.1- 

p21.3 

Table 54. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside the ALL type, derived from SVM. 



15 



Affymetrix 

Locus 
number 



Gene description 



Gene 
symbol 



20 



25 



30 



35 



40 



45 



50 



55 



44 

45 

46 

47 

48 

49 

50 
51 

52 

53 

54 

55 
56 
57 

58 

59 

60 

61 
62 

63 



31810 _g_at 

12q11 

40789_at 

1p34 

3561 4_at 

20q13.3 

34482_at 

4p16.3 

34252_at 

6q16.1 

32638_s_at 

16p12.3 

39440_f_at 

1467_at 

12q23 

37500_at 

19q13.4 

1307_at 

9q22.3 

1530j_at 

37641_at 

36849_at 

38797_at 

8p21.2 

4051 0_at 

1p31.1 

34168_at 

10q23- 

q24 

36682_at 
8p22- 

P21.3 

34335_at 

13q33 

41028_at 

15q14- 

q15 

31434 at 



contactin 1 
adenylate kinase 2 

transcription factor-like 5 (basic helix-loop-helix) 

hypothetical protein MGC4701 

hypothetical protein FLJ 10342 

PI-3-kinase-related kinase SMG-1 

mRNA (from clone DKFZp566HG124) 
epidermal growth factor receptor substrate 

zinc finger protein 175 

xeroderma pigmentosum, complement group A 

ESP 
ESP 

PTPL1 -associated RhoGAP 1 
KIAA0062 protein 

heparan sulfate 2-O-sulfotransferase 
deoxynucleotidyltransferase, terminal 

pericentriolar material 1 

ephrin-B2 

ryanodine receptor 3 



CNTN1 

AK2 

TCFL5 

MGC4701 

FLJ 10342 

SMG1 

EPS8 

ZNF175 

XPA 



PARG1 1 
KIAA0062 

HS2ST1 

DNTT 



PCM1 

EFNB2 
RYR3 



Homo sapiens aconltase precursor (ACON) mRNA, 
nuclear gene encoding mitochondrial, partial cds 



268 



64 


35293 at 


Sjogren syndrome antigen A2 


SSA2 




1q31 






65 


32987 at 


FSH primary response (LRPR1, rat) homolog 1 


FSHPRH1 




Xq22 






66 


34731 at 


KIAA0185 protein 


KIAA0185 




10q24 






67 


35102 at 


zinc finger protein 


ZFP 




3p22.3 






68 


35664 at 


multimerin 


MfARN 




4q22 






69 


32461 f at 


zinc finger protein 81 (HFZ20) 


ZNF81 




Xp22.1 
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14q32 






71 


37282 at 


IVIAD2 (mitotic arrest deficient, yeast)-like 1 


MAD2L1 




4q27 






72 


38407 r at 


prostaglandin D2 synthase (21I^D, brain) 


PTGDS 




9q34.2- 








q34.3 






73 


873 at 


homeo box A5 


HOXA5 




7p15- 







74 

75 
76 
77 
78 
79 
80 



p14 

36539 at 



37602_at 

19p13.3 

3882 1_at 

4q26 

36248_at 

9p12 

33796_at 

7p21 

37760_at 
17q25 
35299_at 
1p33 



Homo sapiens cDNA FLJ32313 fis, clone PROST 
2003232, weakly similar to BETA- 
GLUCURONIDASE PRECURSOR (EC 3.2.1.31) 

guanidinoacetate N-methyltransferase GAMT 

progesterone receptor membrane Component 2 PGRMC2 

NAG-5 protein NAG5 

ADP-ribosylation factor-like 4 ARL4 

BAI1 -associated protein 2 BAIAP2 

MAP kinase-interacting serine/threonine kinase 1 MKNK1 



Table 55. Discriminating genes that distinguish between remission and fail, 
inside the AML type, derived from SVM anedysis. 





Affymetrix 


Gene description 


Gene 




Locus 








number 




symt>ol 


1 


32789 at 


nuclear cap binding protein subunit 2, 20kD 


NCBP2 




3q29 






2 


39175 at 


phosphofructokinase, platelet 


PFKP 




10p15.3 






3 


41058 _g at 


uncharacterized hypothalamus protein HT012 


HT012 




6p22.2 






4 


38299 at 


interleukin 6 (interferon, beta 2) 


IL6 




7p21 







269 





5 


41475 at 


ninjurin 1 


NINJ1 






9q22 








6 


38389 at 


2\5'-oItgoadenylate synthetase 1 (40-46 kD) 


OAS1 






12q24.1 






5 


7 


35803 at 


ras homolog gene family, member E 


ARHE 






2q23.3 








8 


36419 at 


phospholipase C, beta 3 


PLCB3 






11q13 








9 


32067 at 


cAMP responsive element modulator 


CREM 


10 




10p12.1 






10 


39924 at 


KIAA0853 protein 


KIAA0853 






13q14 








11 


39246 at 


stromal antigen 1 


STAG1 


15 




3q22.3 






12 


38252 s at 


glycogen debranching enzyme (disease type III) 


AGL 






1p21 








13 


35127 at 


H2A histone family, member A 


H2AFA 






6p22.2 








14 


35486 at 


Vertebrate LIN7, Tax interaction protein 33 


VELI1 


20 




12q21 






15 


1368 at 


interleukin 1 receptor, type 1 


IL1R1 






2q12 








16 


40635 at 


flotillin 1 


FLOT1 






6p21.3 






25 


17 


1679 at 


postmeiotic segregation increased 2-like 6 


PMS2L6 






7q11 








18 


37354 at 


nuclear antigen Sp1 00 


SP100 






2q37.1 








19 


1065 at 


fms-related tyrosine kinase 3 


FLT3 


30 




13q12 






20 


41470 at 


prominin (mouse)-like 1 


PROML1 






4p15.33 








21 


37483 at 


histone deacetylase 9 


HDAC9- 






7p21p15 






35 


22 


34363 at 


selenoprotein P. plasma. 1 


SEPP1 






5q31 








23 


34631 at 


eyes absent (Drosophila) homolog 4 


EYA4 






6q23 








24 


33124 at 


vaccinia related kinase 2 


VRK2 


40 




2p16 








25 


39931 at 


dual-specificity tyrosine-(Y)- kinase 3 


DYRK3 






1q32 








26 


37185 at 


serine (or cysteine) proteinase inhibitor 


SERPINB 






18q21.3 






45 


27 


717 at 


GS3955 protein 


GS3955 






2p25.1 










4UoUd r at 


phosphatidyl inositol glycan, class K 


PIGK 






1p31.1 








29 


32636 f at 


PI-3-kinase-related kinase SiVIG-l 


SMG1 


50 




16p12.3 








30 


38052 at 


coagulation factor XIII, A1 polypeptide 


F13A1 






6p25.3- 






55 




p24.3 






31 


772 at 


v-crk avian sarcoma virus oncogene homolog 


CRK 






17p13.3 








32 


41362 at 


ATP-binding cassette, sub-family G (WHITE) 


ABCG1 






21q22.3 








33 


36849 at 


PTPL1 -associated RhoGAP 1 


PARG1 



34 


1451 s at 


osteoblast specific factor 2 (fasciclin l-like) 


OSF-2 




13q13.2 






35 


37547 at 


PTH-responsive osteosarcoma B1 protein 


B1 




7p14 






36 


37504 at 


E3 ubiquitin ligase SMURF1 


SMURF1 




7q21.1 






37 


33881 at 


fatty-acid-Coenzyme A ligase, long-chain 3 


FACL3 




2q34 






38 


40439 at 


arsA (bacterial) arsenite transporter, ATP-binding 


ASNA1 




19q13.3 






39 


1914 at 


cyclin A1 


CCNA1 




13q12.3 






40 


40928 at 


DKFZP564A122 protein 


DKFZP 




17q11.2 






41 


36014 at 


hypothetical protein DKFZp564D0462 


DKFZP 




6q23.1 






42 


34355 at 


methyl CpG binding protein 2 (Rett syndrome) 


MECP2 




Xq28 






43 


38096 f at 


MHC. class II, DP beta 1 


DPB1 




6p21.3 






44 


32298 at 


a disintegrin and metalloproteinase domain 2 


ADAM2 




8p11.2 






45 


35699 at 


budding uninhibited by benzlmidazoles 1 


BUB1B 




15q15 






46 


41165 g at 


immunoglobulin heavy constant mu 


IGHM 




14q32 







Table 55. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside the AML type, derived from SVM analysis. 



Affymetrix Gene description Gene 
Locus 

number symbol 



47 


35422 at 


microtubule-associated protein 2 


MAP2 




2q34 






48 


41471 at 


SI 00 calcium-binding protein A9 (calgranulin B) 


S100A9 




1q21 






49 


34761 r at 


a disintegrin and metalloproteinase domain 9 


ADAM9 


50 


31786 at 


Sam68-like phosphotyrosine protein, T-STAR 


T-STAR 




8q24.2 






51 


40318 at 


dynein, cytoplasmic, intermediate polypeptide 1 


DNCI1 




7q21.3 






52 


40497 at 


homologous to yeast nitrogen permease 


NPR2L 




3p21.3 






53 


34728 g at 


S-adenosylhomocysteine hydrotase-like 1 


AHCYL1 1 


54 


36857 at 


RAD1 (S. pombe) homolog 


RAD1 




5p13.2 






55 


39449 at 


bleomycin hydrolase 


BLMH 




17q11.2 






56 


40498_g at 


homologous to yeast nitrogen permease 


NPR2L 




3p21.3 






57 


37936 at 


PRP4/STK/WD splicing factor 


HPRP4P 




9q31 






58 


34891 at 


dynein, cytoplasmic, light polypeptide 


PIN 




14q24 







271 





59 


39061 at 


bone marrow stromal cell antigen 2 


BST2 






19p13.2 








60 


34446 at 


KIAA0471 gene product 


KIAA0471 






1q24 






5 


61 


37456 at 


serum constituent protein 


MSE55 






22q13.1 








62 


41385 at 


erythrocyte membrane protein band 4.1 -like 3 


EPB41L3 






18p11 








63 


990 at 


fms-related tyrosine kinase 1 (vascular endothelial FLT1 


10 




13q12 












growth factor/vascular permeability factor receptor) 




64 


37203 at 


carboxylesterase 1 


CES1 






16q13 








65 


40071 at 


cytochrome P450, subfamily 1 


CYP1B1 


15 




2p21 








66 


1491 at 


pentaxin-related gene, induced by IL-1 beta 


PTX3 






3q25 








67 


31558 at 


Hr44 antigen 


HR44 




68 


761 g at 


dual-specificity tyrosine-(Y)-phosphorylation 


DYRK2 


20 




12q14.3 












regulated kinase 2 






69 


32607 at 


brain abundant, membrane signal protein 1 


BASP1 






5p15.1 








70 


32305 at 


collagen, type 1, alpha 2 


COL1A2 


25 




7q22.1 








71 


531 at 


glioma pathogenesis-related protein 


RTVP1 






12q15 








72 


40901 at 


nuclear autoantigen 


GS2NA 






14q13 






30 


73 


35609 at 


protocadherin gamma subfamily A, 8 


PCDHGA8 






5q31 








74 


40851 r at 


Sec23 (S. cerevisiae) homolog B 


SEC23B 






20p11 








75 


41022 r at 


glycerol-3-phosphate dehydrogenase 2 


GPD2 


35 




2q24.1 








76 


40853 at 


ATPase, Class V, type 10D 


ATP10D 






4p12 






77 


38555 at 


dual specificity phosphatase 10 


DUSP10 






1q41 






40 


78 


41393 at 


zinc finger protein 195 


ZNF195 






11p15.5 








79 


32089 at 


sperm associated antigen 6 


SPAG6 






10p12 








80 


32072 at 


mesothelin 


MSLN 


45 




16p13.3 








81 


394 at 


S-phase kinase-associated protein 2 (p45) 


SKP2 






5p13 








82 


32605 r at 


RAB1, member RAS oncogene family 


RAB1 






2p14 






50 


83 


31665 s at 


CDA02 protein 


CDA02 






3q24 








84 


35940 at 


POU domain, class 4, transcription factor 1 


POU4F1 






13q21.1 








85 


37469 at 


Rough Deal (Drosophila) homolog 


KIAA0166 


55 




12q24 








86 


32599 at 


tut>erous sclerosis 1 


TSC1 






9q34 








87 


33894 at 


neuroepithelial cell transforming gene 1 


NET1 






10p15 







272 



Table 56. Discriminating genes that distinguish between remission and fail, 
inside the Vxinsight cluster A, derived from Bayesian Networks and SVM 
analysis. 

A. Bayesian Networks 



10 




Affymetrix 
Locus 
number 


Gene description 


Gene 
symbol 




1 


1247_g_at 


protein tyrosine phosphatase, receptor type, S 


PTPRS 






19p13.3 






15 


2 


128_at 


cathepsin K (pycnodysostosis) 


CTSK 






1q21 








3 


1445 at 


chemoklne (C-C motif) receptor-lilte 2 


CCRL2 






3p21 








4 


1509 at 


matrix metalloproteinase 16 (membrane-inserted) 


MMP16 


20 




8q21 








5 


1523 _g at 


tyrosine l<inase, non-receptor, 1 


TNK1 






17p13.1 








6 


1578 J at 


androgen receptor (dihydrotestosterone receptor; 


AR 






Xq11.2- 






25 






testicular feminization; spinal and bulbar muscular 








q12 












atrophy; Kennedy disease) 






7 


158 at 


DnaJ (Hsp40) homolog, subfamily B, member 4 


DNAJB4 






1p22.3 






30 


8 


1777 at 


ras inhibitor 


RIN1 






11q13.1 








9 


31375 at 


ADP-ribosylation factor-like 3 


ARL3 






10q23.3 








10 


31440 at 


transcription factor 7 (T-cell specific, HMG-box) 


TCF7 


35 




5q31.1 








11 


31552 at 


Homo sapiens low density lipoprotein receptor 






12 


31713 s at 


large (Drosophila) homolog-associated protein 2 


DLGAP2 






8p23 








13 


31996 at 


brefeidin A-inhibited guanine nucleotide-exchange 


2BIG2 


40 




20q13 








14 


32029 at 


3-phosphoinositide dependent protein kinase-1 


PDPK1 






16p13.3 








15 


32823 at 


vacuolar protein sorting 1 1 (yeast homolog) 


VPS11 






11q23 






45 


16 


32903 at 


transforming growth factor, beta receptor 1 


TGFBRI 






9q22 








17 


33019 at 


Parkinson disease (autosomal recessive, juvenile) 


PARK2 






6q25.2 








18 


33280 r at 


SA (rat hypertension-associated) homolog 


SAH 


50 




16p13.11 








19 


341 10 J at 


proline oxidase homolog 


PIG6 




20 


34124 at 


similar to prokaryotic-type class 1 peptide chain 


LOCI 6 






6q25 












release factors 




55 


21 


34181 at 


aspartylglucosaminidase 


AGA 






4q32 








22 


35044 i at 


bone morphogenetic protein 8 (osteogenic 2) 


BMP8 






1p35 







273 



10 



23 35375_at apurinic/apyrimidinic enclonuclease(nuclease) APEXL2 
Xp11.23 

24 35942_at GA-binding protein transcription factor, beta 1 GABPB1 
7q11.2 



Table 56. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside tiie Vxinsight cluster A, derived from SVM analysis. 



B. SVM 



Affymetrix Gene description Gene 
Locus 

15 number synnbol 





1 


39389 at 


CD9 antigen (p24) 


CD9 






12p13.3 






20 


2 


1292 at 


dual specificity phosphatase 2 


DUSP2 






2q11 








3 


36674 at 


small inducible cytokine A4 


SCYA4 






17q12 








4 


32637 r at 


PI-3-kinase-related kinase SMG-1 


SMG1 


25 




16p13.2 








5 


35756 at 


regulator of G-protein signalling 19 interacting 


RGS19IP1 






19p13.1 








6 


41700 at 


coagulation iacXor II (thrombin) receptor 


F2R 






5q13 






30 


7 


31853 at 


embryonic ectoderm development 


EED 






11q14 








8 


31329 at 


Human putative opioid receptor mRNA, complete 






9 


34491 at 


2'-5'-oligoadenyiate synthetase-like 


OASL 






12q24.2 






35 


10 


34961 at 


T cell activation, increased late expression 


TACTILE 






3q13.2 








11 


160021_r_at progesterone receptor 


PGR 






11q22-q23 








12 


38367 s at 


complement component 4 binding protein, beta 


C4BPB 


40 




1q32 








13 


32279 at 


glutamate decartx>xylase 2 (pancreas and brain) 


GAD2 






10p11.23 








14 


36108 at 


MHC, class 11, DQ beta 1 


0QB1 






6p21.3 






45 


15 


34378 at 


adipose differentiation-related protein 


ADFP 






9p21.2 








16 


777 at 


GDP dissociation inhibitor 2 


GDI2 






10p15 








17 


35140 at 


cyclin-dependent kinase 8 


GDK8 


50 




13q12 








18 


33208 at 


DnaJ (Hsp40) homolog, subfamily C, member 3 


DNAJC3 






13q32 








19 


33405 at 


adenylyl cyclase-assoclated protein 2 


CAP2 






6p22.2 






55 


20 


39580 at 


KIAA0649 gene product 


KIAA0649 






9q34.3 








21 


32469 at 


carcinoembryonic antigen-related cell adhesion 


CEACAM 



19q13.2 



274 





22 


38539 at 


solute carrier family 24 


SLC24A1 






15q22 






23 


33739 at 


Homo sapiens mRNA full length insert cDNA 






24 


1454 at 


MAD, mothers against decapentaplegic 3 


MADH3 


5 




15q21-q22 








25 


35289 at 


rab6 GTPase activating protein 


CENA 






9q34.11 








26 


37724 at 


v-myc myelocytomatosis viral oncogene homolog 


MYC 






8q24.12 






10 


27 


32521 at 


secreted frizzled-related protein 1 


SFRP1 






8p12-p11.1 








28 


1375 s at 


tissue inhibitor of metalloproteinase 2 


TIMP2 






17q25 








29 


615 s at 


parathyroid hormone-like hormone 


PTHLH 


15 




12p12.1 








30 


555 at 


RAB40B, member RAS oncogene family 


RAB40B 






17q25.3 








31 


224 at 


TGFB inducible early growth response 


TIEG 






8q22.2 






20 


32 


40367 at 


bone morphogenetic protein 2 


BMP2 






20p12 








33 


37380 at 


general transcription fector MB 


GTF2B 






1p22-p21 








34 


41504 s at 


v-maf aponeurotic fibrosarcoma oncogene 


MAF 


25 




16q22-q23 








35 


40166 at 


CS box-containing WD protein 


LOC55 




36 


35228 at 


carnitine palmitoyltransferase 1, muscle 


CPT1B 






22q13.33 








37 


36113 s at 


troponin T1, skeletal, slow 


TNNT1 


30 




19q13.4 








38 


33491 at 


sucrase-isomaltase 


SI 






3q25.2 








39 


1182 at 


phospholipase C-like 1 


PLCL1 






2q33 






35 


40 


38869 at 


KIAA1069 protein 


KIAA1069 






3q26.1 








41 


35811 at 


ring finger protein 13 


RNF13 






3q25.1 








42 


33186 i at 


ESTs 




40 


43 


37504 at 


E3 ubiquitin ligase SMURF1 


SMURF1 






7q21.1 








44 


160025 at 


transforming growth ^ctor, alpha 


TGFA 






2p13 







45 Table 56. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside the Vxinsight cluster A, derived from SVM analysis. 

Affymetrix Gene description Gene 
50 Locus 

number symbol 



45 32684_at Homo sapiens clone 23579 mRNA sequence 

55 46 35233_r_at centrin, EF-hand protein, 3 (CDC31 homolog) CETN3 

5q14.3 

47 40399_r_at mesenchyme homeo box 2 (growth arrest) MEOX2 

7p22.1 



275 



36777_at 

12p13.2 

31810_g_at 

12q11-q12 

33747_s_at 

1p36.1 

37577_at 

10q24.2 

40789_at 

1p34 

34855_at 

14q32.31 

3561 4_at 

20q13.3 

34482_at 

4p16.3 

37220_at 

1q21.2 

36444_s_at 

17q21.1 

34252_at 

6q16.1 

32638_s_at 

16p13.2 

1467_at 

12q23-q24 

37500_at 

19q13.4 

1307_at 

9q22.3 

1530 _g_at 

13q12.3 

37641_at 

1p31.1 

36849_at 

1p22.1 

38797_at 

8p21.2 

40510_at 

1p31.1 

34168_at 

10q23-q24 

36682_at 

8p22-p21.3 

34335_at 

13q33 

40549_at 

7q36 

41028_at 

15q14-q15 

31434_at 

33031_at 

35293_at 

1q31 

32987_at 

Xq22 

34731_at 

10q25.1 

35102_at 

3p22.3 



DNA segment on chromosome 12 (unique) 2489 D12S 

contactin 1 CNTN1 

RNA. U17D small nucleolar RNU17D 

hypothetical protein MGC14258 MGC 

adenylate kinase 2 AK2 

hypothetical protein MGC5378 MGC5378 

transcription factor-like 5 (basic helix-loop-hetix) TCFL5 

hypothetical protein MGC4701 MGC4701 

Fc fragment of IgG, receptor for - CD64 FCGR1 A 

small inducible cytokine subfamily A SCYA23 

hypothetical protein FLJ10342 FLJ10342 

PI-3-kinase-related kinase SMG-1 SMG1 

epidermal growth factor receptor 8 EPS8 

zinc finger protein 175 ZNF175 

xeroderma pigmentosum, complement group A XPA 

hypothetical protein CG003 13CDNA 

interferon-induced protein 44 IFI44 

PTPL1 -associated RhoGAP 1 PARG1 

KIAA0062 protein KIAA0062 

heparan sulfate 2-O-sulfotransferase 1 HS2ST1 

deoxynucleotidyltransferase, terminal DNTT 

pericentriolar material 1 PCM1 

ephrin-B2 EFNB2 

cyclin-dependent kinase 5 CDK5 

ryanodine receptor 3 RYR3 

Homo sapiens aconitase precursor (ACON) 
Homo sapiens mRNA full length insert cDNA clone 

Sjogren syndrome antigen A2 (60kD) SSA2 

FSH primary response (LRPR1 homolog, rat) 1 FSHPRH1 

KIAA0185 protein KIAA0185 

zinc finger protein ZFP 



276 



10 



15 



20 



79 


35664 at 


multimerin 


MMRN 




4q22 






80 


34208 at 


solute carrier family 12, memt)er 5 


SLC12A5 




20q13.12 






81 


37864 s at 


immunoglobulin heavy constant gamma 3 


IGHG3 




14q32.33 






82 


37282 at 


MAD2 mitotic arrest deficient-like 1 (yeast) 


MAD2L1 




4q27 






83 


38407 r at 


prostaglandin D2 synthase (21 kD, brain) 


PTGDS 




9q34.2 






84 


37602 at 


guanidinoacetate N-methyltransferase 


GAMT 




19p13.3 






85 


38821 at 


progesterone receptor membrane component 2 


PGRMC2 




4q26 






86 


36248 at 


NAG-5 protein 


NAGS 




9p11.2 






87 


33796 at 


epithelial protein lost in neoplasm beta 


EPLIN 




12q13 






88 


37760 at 


BAM -associated protein 2 


BAIAP2 




17q25 






89 


35299 at 


MAP kinase-interacting serine/threonine kinase 1 


MKNK1 




1p34.1 







Table 57. Discriminating genes that distinguish between remission and fail, 
25 inside the Vxinsight cluster C, derived from Bayesian Networks and SVM 
analysis. 



A. Bayesian Networks 



30 




Affymetrix 
Locus 
number 


Gene description 


Gene 
symbol 


35 


1 


111 at 


Rab geranylgeranyltransferase, alpha subunit 


RAB 






14q11.2 








3 


1274 s at 


cell division cycle 34 


CDC34 






19p13.3 








4 


1561 at 


dual specificity phosphatase 8 


DUSP8 


40 




11p15.5 








6 


31405 at 


melatonin receptor 1 B 


MTNR1B 






11q21-q22 








7 


31803 at 


KIAA0653 protein, B7-like protein 


KIAA0653 






21q22.3 






45 


8 


32334 f at 


ubiquitin C 


UBC 






12q24.3 








9 


32892 at 


ribosomal protein S6 kinase, 90kD 


RPS6KA2 






6q27 








10 


33095 i at 


beaded filament structural protein 2, phakinin 


BFSP2 


50 




3q21-q25 








11 


33293 at 


lifeguard 


KIAA0950 






12q13 








12 


34913 at 


calcium channel, voltage-dependent, L type 


CACNA1S 






1q32 






55 


13 


35957 at 


stannin 


SNN 






16p13 








14 


36038 r at 


spectrin, beta, erythrocytic 


SPTB 






14q23 
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15 

16 

17 

18 
19 

20 
21 

22 
23 
24 
25 



36342_r_at 

1q31-q32.1 

37596_at 

3p22-p21.3 

38299_at 

7p21 

41520_at 

772_at 

17p13.3 

1001_at 

1p34-p33 

1707_g_at 
Xp11.4- 

p11.2 

1719_at 

5q11-q12 

1962_at 

6q23 

2034_s_at 
12p13.1 
31505_at 
8q24.3 



H factor (complement)-like 3 HFL3 

phospholipase C, delta 1 PLC01 

interleukin 6 (interferon, beta 2) IL6 

hypothetical protein LOC56148 

v-crk avian sarcoma virus CT10 oncogene CRK 

tyrosine kinase with immunoglobulin and TIE 

epidermal growth factor homology domains 

v-raf murine sarcoma viral oncogene homolog ARAF1 



mutS (E. coli) homolog 3 MSH3 

arginase, liver ARG1 

cyclin-dependent kinase inhibitor 1 B CDKN1 B 

ribosomal protein L8 RPL8 



Table 57. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside the Vxinsight cluster C, derived from SVM analysis. 



B. SVM 



Affymetrix Gene description Gene 
Locus 

number symbol 



1 


914 g at 
21q22.3 


v-ets erythroblastosis virus E26 oncogene like 


ERG 


2 


32789 at 
3q29 


nuclear cap binding protein subunit 2, 20kD 


NCBP2 


3 


38299 at 
7p21 


interleukin 6 (interferon, beta 2) 


IL6 


4 


39175 at 
10p15.3 


phosphofructokinase, platelet 


PFKP 


5 


1368 at 
2q12 


interleukin 1 receptor, type 1 


IL1R1 


6 


41219 at 


Homo sapiens mRNA; cDNA DKFZp586J101 




7 


38389 at 
12q24.1 


2',5'-otigoadenylate synthetase 1 (40-46 kD) 


OAS1 


8 


32067 at 
10p12.1 


cAMP responsive element modulator 


CREM 


9 


41058 g at 
6p21.32 


uncharacterized hypothalamus protein HT012 


HT012 


10 


41425 at 
11q24.1 


Friend leukemia virus integration 1 


FLU 


11 


33124 at 
2p16-p15 


vaccinia related kinase 2 


VRK2 
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10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



12 


41475 at 


ninjurin 1 




9q22 




13 


38866 at 


EST 


14 


35803 at 


ras homolog gene family, member E 




2q23.3 




15 


41096 at 


SI 00 calcium binding protein A8 (caigranulin A) 




1q21 




16 


33800 at 


adenylate cyclase 9 




16p13.3 




17 


37143 s at 


phosphoribosylformylglycinamidine synthase 




17p13 




18 


37535 at 


cAMP responsive element binding protein 1 




2q32.3-q34 




19 


38253 at 


amylo-1 , 6-gtucosidase, 4-alpha- 




1p21 




20 


36857 at 


RAD1 homolog (S. pombe) 




5p13.2 




21 


39931 at 


dual-specificity tyrosine-(Y)-phosphorylation 




1q32 








regulated kinase 3 


22 


772 at 


v-crk sarcoma virus CT10 oncogene homolog 




17p13.3 




23 


35957 at 


stannin 




16p13 




24 


41755 at 


KIAA0977 protein 




2q24.3 




25 


31786 at 


RNA binding, signal transduction associated 3 




8q24.2 




26 


35127 at 


H2A histone family, member A 




6p22. 




27 


40928 at 


SOCS box-containing WD protein SWiP-1 




17q11.1 




28 


32636 f at 


PI-3-kinase-related kinase SMG-1 




16p13.2 




29 


531 at 


glioma pathogenesis-related protein 




12q14.1 




30 


35860 r at 


ESTs 


31 


41471 at 


SI 00 calcium binding protein A9 (caigranulin B) 




1q21 




32 


35582 at 


ESTs 


33 


39878 at 


protocadherin 9 




13q14.3 




34 


37504 at 


E3 ubiquitin ligase SMURF1 




7q21.1 




33 


34965 at 


cystatin F (leukocystatin) 




20p11.21 




34 


37050 r at 


translocase of outer mitochondrial membrane 34 


35 


32034 at 


zinc finger protein 217 




20q13.2 




36 


33104 at 


PH domain containing protein in retina 1 




11q13.5 




37 


40318 at 


dynein, cytoplasmic, intermediate polypeptide 1 




7q21.3 




38 


34387 at 


KIAA0205 gene product 




1p36.13 




39 


37208 at 


phosphoserine phosphatase-like 




7q11.2 




40 


38139 at 


fucose-1 -phosphate guanylyltransferase 




1p31.1 





NINJ1 

ARHE 

S100A8 

ADCY9 

PFAS 

CREB1 

AGL 

RAD1 

DYRK3 

CRK 
SNN 

KIAA0977 

KHDRBS3 

H2AFA 

WSB1 

SMG1 

RTVP1 

S100A9 

PCDH9 

SMURF1 

CST7 

TOMM34 
ZNF217 

PHRET1 

DNCI1 

KIAA0205 

PSPHL 

FPGT 
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41 1914_at cyclinM 
13q12.3 

42 IMjaX GS3955 protein 
2p25.1 

5 

Table 57. (Continuation). Discriminating genes that distinguish between 
remission and fail, inside the Vxinsight cluster C, derived from SVM analysis. 



10 




Affymetrix 

Locus 
number 


Gene description 


Gene 
symbol 


15 


43 


36123 at 


thiosulfate sulfurtransferase (rhodanese) 


TST 






22q13.1 








44 


33881 at 


fatty-acid-Coenzyme A ligase, long-chain 3 


FACL3 






2q34-q35 








45 


35606 at 


histidine decartM)xylase 


HDC 


20 




15q21-q22 








46 


36478 at 


transcription termination factor, RNA polymerase 1 TTF1 






9q34.3 








47 


34363 at 


selenoprotein P, plasma, 1 


ScPPI 






5q31 






25 


48 


34631 at 


eyes absent homolog 4 (Drosophila) 


EYA4 






6q23 








49 


37773 at 


Kl AA1 005 protein 


Kl AA1 005 






16q12.2 








50 


1451 s at 


osteoblast specific factor 2 (fasciclin Mike) 


OSF-2 


30 




13q13.2 








51 


40635 at 


flotillin 1 


FLOTI 






6p21.3 








52 


34961 at 


T cell activation, increased late expression 


TACTILE 






3q13.2 






35 


53 


32637 r at 


PI-3-kinase-related kinase SMG-1 


SMG1 






16p13.2 








54 


1808 s at 


tumor necrosis factor receptor superfamily, 6 


TNFRSF6 






10q24.1 








55 


1369 s at 


interteukin 8 


IL8 


40 




4q13-q21 








56 


35614 at 


transcription factor-iike 5 (basic hetix-toop-helix) 


TCFL5 






20q13.3 








57 


40511 at 


GATA binding protein 3 


GATA3 






10p15 






45 


58 


1229 at 


cisplatin resistance associated 


CRA 






1q12-q21 








59 


34247 at 


protease, serine, 12 (neurotrypsin, motopsin) 


PRSS12 






4q25-q26 








60 


35980 at 


phospholipase C, beta 1 


PLCB1 


50 




20p12 








61 


33715 r at 


general transcription factor IIH, polypeptide 2 


GTF2H2 






5q12.2 








62 


852 at 


integrin, beta 3 


ITGB3 






17q21.32 






55 


63 


1913 at 


cyclin G2 


CCNG2 






4q13.3 








64 


36569 at 


tetranectin (plasminogen binding protein) 


TNA 






3p22-p21.3 
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CCNA1 
GS3955 



10 



20 



30 



40 



50 



65 


41708 at 


KIAA1 034 protein 


KIAA1034 




2q33 






66 


41348 at 


iroquois homeobox protein 5 


IRX5 




16q11.2 






67 


38952 s at 


collagen, type XIII, alpha 1 


COL13A1 




10q22 






68 


33553 r at 


chemokine (C-C motif) receptor 6 


CCR6 




6q27 






69 


41165 g at 


immunoglobulin heavy constant mu 


IGHM 




14q32.33 






70 


34435 at 


aquaporin 9 


AQP9 




15q22.1 






71 


1679 at 


postmeiotic segregation increased 2-like 6 


PMS2L6 




7q11-q22 






72 


41742 s at 


optlneurin 


OPTN 




10p12.33 






73 


36998 s at 


spinocerebellar ataxia 2 


SCA2 




12q24 






74 


39032 at 


transforming growth factor beta-stimulated protein TSC22 




13q14 






75 


1065 at 


fms-related tyrosine kinase 3 


FLT3 




13q12 






76 


40584 at 


nucleoporin 88kD 


NUP88 


• 


17p13 






77 


41470 at 


prominin-like 1 (mouse) 


PROML1 




4p15.33 






78 


38470 i at 


amyloid beta precursor protein 


APPBP2 




17q21-q23 






79 


37676 at 


phosphodiesterase 8A 


PDE8A 




15q25.1 






80 


35449 at 


killer cell lectin-like receptor B, member 1 


KLRB1 




12p13 






81 


36474 at 


KIAA0776 protein 


KIAA0776 




6q16.3 






82 


32142 at 


serine/threonine kinase 3 (STE20 homolog, yeast) STK3 




8q22.1 






83 


39299 at 


KIAA0971 protein 


KIAA0971 




2q33.3 






84 


38252 s at 


1 , 6-glucosidase, 4-alpha-glucanotransferase 


AGL 




1p21 






85 


39246 at 


stromal antigen 1 


STAG1 




3q22.3 






86 


160030 at 


growth hormone receptor 


GHR 




5p13-p12 






87 


33736 at 


stomatin (EBP72)-like 1 


STOML1 




15q24-q25 






88 


36014 at 


hypothetical protein DKFZp564D0462 


DKFZP56 




6q23.1 






89 


32072 at 


mesothelin 


MSLN 




16p13.12 







6. Additional explorations on Vxinsight clustering results with the Genetic 
55 Algorithm K-Nearest Neighbor method (GA/KNN). 

As it was previously mentioned, the Vxinsight clustering algorithm 
identified three major groups. A, B, and C, in the infant leukemia dataset. We 
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hypothesized these groups correspond to distinct biologic clusters, correlated 
with unique disease etiologies. Several approaches were used to evaluate 
cluster stability and to determine genes that discriminate between the clusters. 
In order to test how well these three clusters can be distinguished using 
5 supervised classification and cross-validation methods (49) we used a genetic 
algorithm training methodology to perform feature selection using a simple K- 
nearest neighbor classifier (50, 51). This approach was applied using Vxinsight 
cluster train/test class labels, creating three implied one-vs.-all classification 
problems (A vs. B+C, etc.) The "top 50" discriminating gene lists are reported 

10 for each problem, and compared with previously obtained ANOVA gene lists. 

To compare this "top 50" gene lists with the gene lists generated using 
ANOVA, we used a one-vs-all-others (OVA) approach to form three binary 
classification problems: a) A vs. BC; b) B vs. CA; c) C vs. AB. Based on our 
subsequent numerical results (time to solution for the genetic algorithm). Task 

15 (a) appears to have been the easiest and Task (b) the hardest. We also did three- 
way classification for Vxinsight groups. It is Task (d). 

6.1 . GA/KNN procedure and parallel program parameters 

The Genetic Algorithm (GA) K Nearest Neighbor (KNN) method (50, 

20 51) is a supervised feature selection method based on the non-parametric k- 
nearest neighbor classification approach (52). GA uses a direct analogy of 
natural behavior and works with a "population" of "chromosomes." Each 
chromosome represents a possible solution to a given problem. A chromosome 
is assigned a fitness score according to how good a solution to the problem it is. 

25 Highly fit individuals are given opportunities to "reproduce," by "cross 
breeding" with other individuals in the population. This produces new 
individuals (offspring), which share some features taken firom each parent. The 
least fit members of the population are less likely to get selected for 
reproduction, and so die out. Selecting the best individuals firom the current 

30 "generation" and mating them to produce a new set of individuals produce an 
entirely new population of possible solutions. This new generation contains a 
higher proportion of the characteristics possessed by the good members of the 
previous generation. In this way, over many generations, good characteristics 
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are spread throughout the population, being mixed and exchanged with other 
good characteristics. The fitness of each chromosome is determined by its 
ability to classify the training set samples according to the KNN procedure. In 
KNN, each sample was classified according to its k nearest neighbors, using the 
Euclidean distance metric in ^-dimensional space (d is the number of probesets 
in the expression profile for a given patient sample). In our initial experiments, 
we have chosen ^=3. In consensus rule, if all of the k nearest neighbors of a 
sample belong to the same class, the sample is classified as that class; 
otherwise, the sample is considered unclassifiable. In majority rule, if more than 
half of the k nearest neighbors of a sample belong to the same class, the sample 
is classified as that class; otherwise, the sample is considered unclassifiable. 

The GA/KNN methodology was implemented as a C/MPI parallel 
program on the LosLobos Linux supercluster. The program terminates when 
2000 good solutions have been obtained. Following this initial processing, the 
fi-equency with which each probeset was selected was analyzed. 

The parameters used were as follows: 

o Number of independent GA runs: 2000 

o Number of generations/run: 1000 

o Number of chromosomes in population: 100 

o Number of genes in each chromosome: 20 

o Number of neighbors (k) in KNN: 3 

o KNN rules: consensus in training; majority in test 

o Number of parallel compute nodes (2 processors/node): 26 

o Number of master nodes: 1 

o Number of slave processes: 50 

6.2. Methods 

]) Select predictor probesets 

Using the Vxinsight cluster labels, we applied the GA/KNN methodology to 
select the top 50 discriminating probesets from the initial list of 8446 probesets 
for each task. Here we used consensus rule. 
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2) Compare with Vxinsight cluster-characterizing genes 

The Vxinsight clustering algorithm identified 126 cluster-characterizing genes 
for each task according to the F values in ANOVA. The lists include top up- 
regulated and down-regulated genes. Here we compared them with our 
5 predictor probesets. 

3) Evaluate classifier performance 

Both leave-one-out cross validation (LOOCV) and evaluation on an 
independent test set were used to evaluate classifier performance for the 
Vxinsight clusters. Note that we have made no attempt at this stage to 
10 optimize — using the training set only, and blinded to the test set — the number of 
features selected for the final out-of-sample test set evaluation. Here LOOCV 
based on consensus rule and prediction for test dataset based on majority rule; 

4) Statistical significance analysis 

The statistical significance of the predictions was calculated. We tested whether 
15 the Success Rate (SR) was larger than 0.5 and whether the Odds Ratio 
(OR=TP/FP) was larger than 1 . 
6.3. Results 

1) Top gene selections-- Z-scoxq plots were computed from gene selection 
fi-equencies in the GA (see (50, 51) for details). A very high Z-score 

20 gene "40103_at" was found for cluster B vs, CA and C vs, AB. 

2) Top gene lists- Tables 58 (A vs. BC), 59 (B vs. CA) and 60 (C vs. AB) 
show the overlap with 'up*- and 'down'-regulated gene lists in the infant 
cohort as indicated. The numbers of overlapping genes between the 
cluster-characterizing genes and our top 50 genes are 20, 17, and 17 for 

25 A vs. BC, B vs. CA, and C v^. AB tasks respectively. 

3) Evaluating the performance of a classifier 

See Table 61 . Here pVall is p-value of testing whether the SR is larger than 
0.5 and pVal2 is p-value of testing whether the OR is larger than 1 . Both pValls 
and pVal2s are very small («0.05) for our predictions. So they are significant. 

30 
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4) Classification results with DIFF genes 

The numbers of DIFF calls are 46, 32, and 36 in top 50 discriminating 
genes, for A vs. BC, B v^". CA, and C vs. AB respectively. We did classification 
5 only based on DIFF genes, for A vs. BC, B vs. CA, and C vs. AB respectively. 
Unfortunately, no improvement of SRs was observed for test dataset (Table 62). 
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Gene description 


G antigen 2 


myosin iXB 


trans-golgl network protein 2 


ephrin-A3 


activin A receptor, type iS 


5-oxoprolinase (ATP-hydrolysing) | 


o- 
CO 

o' 

z 


! protease, serine, 16 (thymus) 


ribosomal protein S6 kinase, 70kO, polypeptide 2 


serine (or cysteine) proteinase inhibitor, clade G (CI inhibitor), member 1 


contactin 2 (axonal) | 


calcium/calmodulin-dependent protein kinase (CaM kinase) II gamma 


splicing factor, arginine/serine-rich 4 


1 cyclin F | 


coagulation factor II (thrombin) receptor-like 3 


CD34 antigen 


ADP-ribosyltransferase (NAD+; poly(ADP-ribose) polymerase)-like 2 j 


rhomboid (veinlet. Drosophila)-like 


signal transducer and activator of transcription 1, 91kD | 


erythropoietin receptor ! 


|KiAA1 048 protein | 


thioredoxin Interacting protein 


KiAA0274 gene product 


Integrin, alpha 3 (antigen CD49C, alpha 3 subunit of ViJ\-3 receptor) 


ubtqultln-conjugating enzyme E2I (homologous to yeast UBC9) 


ESI (zebrafish) protein, human homolog of 


hemoglobin, delta 


cytochrome c oxidase subunit Via polypeptide 1 


ubiquitin specific protease 4 (proto-oncogene) 


KIAA0870 protein 


ortholog of mouse Integral membrane glycoprotein LIG-1 


gtutamate dehydrogenase 1 


1 Homo sapiens mRNA; cDNA DKFZp586F1 322 (from clone DKFZp586F1 322) 


5-hydroxytryptamine (serotonin) receptor IB 
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Gene description 


KIAA1 71 9 protein 


bridging integrator 1 | 


carboxylesterase 1 (monocyte/macrophage serine esterase 1) | 


ribosomal protein S3A { 


CCAAT/enhancer binding protein (C/EBP), delta | 
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Table 61: Statisticai significance of the prediction for Vxinslght clusters 
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Table 62: OVA classification results for Vxinsight clusters (only with DIFF genes) 
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EXAMPLE XIV 

Heterogeneity of Gene Expression Profiles in A/ZZ-Associated Infant 
Leukemia: Identification of Distinct Expression Profiles and Novel 
10 Therapeutics Targets 

Summary 

Translocations involving the MLL {ALL-l^ HRX, Htrx-J) gene at 
chromosome band 1 lq23 are the most common cytogenetic abnormality seen in 

15 infant leukemia. While there is evidence that MLZ-associated chromosomal 
rearrangements carry a poorer prognosis, the pathogenesis and unique gene 
expression for each MLL rearrangement remain largely undefined. Using 
oligonucleotide arrays (Affymetrix U95Av2) and both unsupervised and 
supervised analysis methods we derived comprehensive gene expression 

20 profiles from a retrospective cohort of 126 infant cases registered to NCI- 
sponsored clinical trials. Fifty-three of those cases had MLL rearrangements 
with several partner genes (AF4, ENL, AFIO, AF9 andAFJQ). We used class 
identification methods (Bayesian networks, Support Vector Machines and 
Discriminant Analysis) to determine genes with common patterns of expression 

25 across all the MLL cases as well as genes that were uniquely expressed and 
distinguishing of each MLL translocation variant. However, class discovery 
tools suggested that the A/ZZ-associated profiles were quite heterogeneous 
among different translocation variants and were dominated by three differential 
expression patterns. Interpretation of our data indicated that infant MLL is an 

30 entity comprising several intrinsic biologic classes not precisely predicted by 
current standards of morphology, inmiunophenotyping, or cytogenetics. 
Consideration of such class-membership could improve classification schemes 
and reveal potential therapeutic targets for MLL-associated leukemias. 
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Introduction 

In Example XIII, we analyzed the gene expression profiles in samples of 
126 infant acute leukemia patients. Three inherent biologic subgroups were 
identified. These groups were not well defined by traditional cell types (AML 
5 vs. ALL) or cytogenetic {MLL vs. not) labels. Instead, they reflected different 
etiologic events with biological and clinical relevance. The distribution of the 
MLL infant cases between those "etiology-driven" clusters is the focus of this 
study. 

1 0 Materials and Methods 

For this study we analyzed 126 diagnostic bone marrow samples fi*om 
patients with acute leukemia who were aged < 1 year at diagnosis. In each case, 
the percentage of blast was >80%. The cohort was designed from cases 
registered to NCI-sponsored Infant Oncology Group/Children's Oncology 

15 Group treatment trials number 8398, 8493, 8821, 9107, 9407 and 9421. Of the 
126 cases, 78 (62%) were acute lymphocytic leukemia (ALL) and 48 (38%) 
were acute myeloid leukemia (AML) by standard morphological and 
immunophenotypic criteria. Fifty-three (42%) cases had translocations 
involving the MLL gene (chromosome segment 1 lq23). An average of 2 x 10^ 

20 cells were used for total RNA extraction with the Qiagen RNeasy mini kit 

(Valencia, CA). The yield and integrity of the purified total RNA were assessed 
using the RiboGreen assay (Molecular Probes, Eugene, OR) and the RNA 6000 
Nano Chip (Agilent Technologies, Palo Alto, CA), respectively. 
Complementary RNA (cRNA) target was prepared fi-om 2.5 |ig total RNA using 

25 two rounds of Reverse Transcription (RT) and In Vitro Transcription (IVT). 
Following denaturation for 5 min at 70^C, the total RNA was mixed with 100 
pmol T7- (dT) 24 oligonucleotide primer (Genset Oligos, La Jolla, CA) and 

allowed to anneal at 42°C. The mRNA was reverse transcribed with 200 units 
Superscript II (Invitrogen, Grand Island, NY) for 1 hr at 42''C. After RT, 0.2 
30 vol 5X second strand buffer, additional dNTP, 40 units DNA polymerase I, 10 
units DNA ligase, 2 units RnaseH (Invitrogen) were added and second strand 
cDNA synthesis was performed for 2 hr at 16°C. After T4 DNA polymerase 
(10 units), the mix was incubated an additional 10 min at 16°C. An equal 
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volume of phenohchloroformrisoamyl alcohol (25:24:1) (Sigma, St. Louis, MO) 
was used for enzyme removal. The aqueous phase was transferred to a 
microconcentrator (Microcon 50, Millipore, Bedford, MA) and 
washed/concentrated with 0.5 ml DEPC water until the sample was 
5 concentrated to 10-20 vd. The cDNA was then transcribed with T7 RNA 
polymerase (Megascript, Ambion, Austin, TX) for 4 hr at 37°C. Following 
IVT, the sample was phenol:chloroform:isoamyl alcohol extracted, washed and 
concentrated to 10-20ul. The first round product was used for a second round 
of amplification which utilized random hexamer and T7- (dT) 24 

10 oligonucleotide primers. Superscript II, two RNase H additions, DNA 

polymerase I plus T4 DNA polymerase finally and a biotin-labeling high yield 
T7 RNA polymerase kit (Enzo Diagnostics, Farmingdale, NY). The biotin- 
labeled cRNA was purified on Qiagen RNeasy mini kit columns, eluted with 
50ul of 45°C RNase-free water and quantified using the RiboGreen assay. 

1 5 Following quality check on Agilent Nano 900 Chips, 1 5ug cRNA were 

fragmented follov^ng the Affymetrix protocol (Affymetrix, Santa Clara, CA)* 
The fragmented RNA was then hybridized for 20 hours at 45°C to HG_U95Av2 
probes. The hybridized probe arrays were washed and stained with the 
EukGE_WS2 fluidics protocol (Affymetrix), including streptavidin 

20 phycoerythrin conjugate (SAFE, Molecular Probes, Eugene, OR) and an 
antibody amplification step (Anti-streptavidin, biotinylated. Vector Labs, 
Burlingame, CA). HG lJ95Av2 chips were scanned at 488 nm, as 
recommended by Affymetrix. The expression value of each gene was 
calculated using Affymetrix Microarray Suite 5.0 software. 

25 

Data Presentation and Exclusion Criteria 

Criteria used as quality controls included: total RNA integrity, cRNA 
quality, array image inspection, B2 oligo performance, and internal control 
genes (GAPDH value greater than 1 800). Of the initial cohort of 142 infant 
30 acute leukemia cases, 126 were finally part of this study. 



302 



Data Analysis 

Afiymetrix MAS 5.0 statistical analysis software was used to process 
the raw microarray image data for a given sample into quantitative signal values 
and associated present, absent or marginal calls for each probe set. A fiher was 
5 then applied which excluded from further analysis all Aflfymetrix "control" 

genes (probe sets labelled with AFFX_ prefix), as well as any probe set that did 
not have a "present" call at least in one of the samples. This filtering step 
reduced the number of probe sets from 12625 to 8414, resulting in a matrix of 
8,414 X 126 signal values. Our Bayesian classification and Vxinsight clustering 

10 analyses omitted this step; choosing instead to assume minimal a priori gene 
selection, as described in Helman et al, 2002 and Davidson et aL, 2001 . 
The first stage of our analysis consisted of a series of binary classification 
problems defined on the basis of clinical and biologic labels. The nominal class 
distinctions were ALL/AML, MLL/not-MLL, and achieved complete remission 

15 CR/not-CR. Additionally, several derived classification problems were 

considered based on restrictions of the fiill cohort to particular subsets of the 
data (such as the Vxinsight clusters). The multivariate supervised learning 
techniques used included Bayesian nets (Helman et aL, 2002) and support 
vector machines (Guyon et al., 2002). The performance of the derived 

20 classification algorithms was evaluated using fold-dependent leave-one-out 

cross validation (LOOCV) techniques. These methods allowed the identification 
of genes associated with remission or treatment failure and with the presence or 
absence of translocations of the MLL gene across the dataset. 

In order to identify potential clusters and inherent biologic groups, a 

25 large number of clinical co-variables were correlated with the expression data 
using unsupervised clustering methods such as hierarchical clustering, principal 
component analysis and a force-directed clustering algorithm coupled with the 
Vxinsight visualization tool. Agglomerative hierarchical clustering with average 
linkage (similar to Eisen et al., 1998) was performed with respect to both genes 

30 and samples, using the MATLAB (The Mathworks, Inc.), MatArray toolbox, as 
well as the native MATLAB statistics toolbox. The data for a given gene was 
first normalized by subtracting the mean expression value computed across all 
patients, and dividing by the standard deviation. The distance metric used for 
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the hierarchical clustering was one minus Pearson's correlation coefficient. This 
metric was chosen to enable subsequent direct comparison with the Vxinsight 
cluster analysis, which is based on the r-statistic transformation of the 
correlation coefficient (Davidson et aL, 2001). 
5 The second clustering method was a particle-based algorithm 

implemented within the Vxinsight knowledge visualization tool. In this 
approach, a matrix of pair similarities is first computed for all combinations of 
patient samples. The pair similarities are given by the /-statistic transformation 
of the correlation coefficient determined from the normalized expression 

10 signatures of the samples (Davidson et al, 2001). The program then randomly 
assigns patient samples to locations (vertices) on a two dimensions graph, and 
draws lines (edges) linking each sample pair, assigning each edge a weight 
corresponding to the pairwise t-statistic of the correlation. The resulting two- 
dimensional graph constitutes a candidate clustering. To determine the optimal 

1 5 clustering, an iterative annealing procedure is followed. In this procedure a 
'potential energy* function that depends on edge distances and weights is 
minimized by following random moves of the vertices (Davidson et al., 1998, 
2001). Once the 2D graph has converged to a minimum energy configuration, 
the clustering defined by the graph is visualized as a 3D terrain map, where the 

20 vertical axis corresponds to the density of samples located in a given 2D region. 
The resulting clusters are robust with respect to random starting points and to 
the addition of noise to the similarity matrix, evaluated through effects on 
neighbour stability histograms (Davidson et al, 2001). 

25 Results 

Expression profiling demonstrates heterogeneity across infant MLL cases 
The determine the variations in gene expression profiles of infant 
leukemia cases involving different MLL rearrangements, 126 infant leukemia 
cases registered to NCI-sponsored Infant Oncology Group/Children's Oncology 
30 Group treatment trials were studied using oligonucleotide microarrays 

containing 12,625 probe sets (Affymetrix U95Av2 array platform). Of the 126 
cases, fifty-three (42%) cases had translocations involving the MLL gene 



304 



(chromosome segment 1 lq23). The distribution of the MLL cytogenetic 
abnomialities across this data set is provided in Table 63. 



Table 63. Distribution of MLL Cytogenetic Abnormalities in the Infant 
S Cohort 

MLL Translocation Total # of Cases 

in Infant Cohort ALL 

AML 



t(4;ll) 


29 


28 


1 


t(ll;19) 


9 


7 


2 


t(10;ll) 


4 


2 


2 


t(l;ll) 


4 


2 


2 


t(9;ll) 


4 


1 


3 


Other MLL 


3 


1 


2 


Not MLL 


42 


26 


16 


Unknown 


31 


11 


20 



20 The initial examination of the data was accomplished using the force 

directed clustering algorithm coupled with the visualization tool, (Davidson et 
ah, 1998; 2001). When applied to the infant cohort, this particle-based 
clustering algorithm demonstrated the existence of three well-separated groups 
of patients that displayed similar patterns of gene expression (Fig. 1 0) These 

25 major clusters were statistically robust and internally consistent as demonstrated 
by linear discrimination analysis with fold-dependent leave one out cross- 
validation (LOOCV). Further analysis demonstrated that the clusters could not 
be completely explained by the traditional diagnostic parameters (morphology: 
ALL vs. AML, or cytogenetics: MLL rearrangement vs. not), implying that the 

30 intrinsic biology may not be driven by these variables. 

Further analysis suggested an association between the three clusters and 
different leukemogenic mechanisms (previously submitted data), called 
hereafter ''stem cell-like", "lymphoid' and " myeloid" T environmental". MLL 
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cases were seen in each of the mentioned patient clusters (Fig. 1 3). The MLL 
cases in the ''stem cell-like" cluster (Cluster A, n=20) were primarily t(4;l 1) 
(n=7), as well as two cases with t(10;l 1) and one with t(l 1 ;19). The ''lymphoid" 
cluster (Cluster B, n=52) included only one AML case and contained a large 
5 number of t(4; 1 1 ) (n=2 1 ) cases as well as four cases with t( 11 ; 1 9), one case 
with t(10;ll), and one case with t(l;ll). Finally, the "myeloid" cluster (Cluster 
C, n=54) was predominantly AML but contained twelve cases with an ALL 
label that nonetheless had a more ''myeloid' pattern of gene expression. This 
cluster included some MLL cases with t(4;l 1), all the t(9;l 1), some t(ll;19), 

10 and t(X;l 1). It has been suggested that in contrast to ALL, AML patients with 
MLL rearrangements do not tend to co-express lymphoid -and myeloid- 
associated antigens simultaneously on leukemic blasts and have outcomes 
similar to those without the gene rearrangements (Tien, 2000). Our data 
supports this view, since roughly the same frequencies of long-term remission 

15 (30%) and failures (70%) were observed in the "myeloid" cluster in patients 
irrespective of MLL translocations. 

An important finding of the present study is that two very distinct groups 
of gene expression profiles could be identified across cases with the same 
t(4;l 1) rearrangement (VxJnsight clusters A and B). Using ANOVA, a gene list 

20 that characterizes the t(4;l 1) groups within the infant clusters A and B was 

derived (Fig. 15). There is a considerable degree of overlap between the cluster 
A-characterizing genes and those that distinguish the t(4;l 1) cases in this group 
(previously submitted data). Cluster A was typified by genes of particular 
interest in signal transduction (EFNA3, B7 protein, Cytokeratin type II, latent 

25 transforming growth factor beta binding protein 4, Contactin 2 axonal, and 
Erythropoietin receptor precursor), transcription regulation (Integrin a3 
(ITGA3), Ataxin 2 related protein (A2LP) and Heat-shock transcription factor 
4, (HSF4)) and cell-to-cell signaling (Myosin-binding protein C slow-type). 
Although most useful in the separation of the cluster A cases, these genes seem 

30 to be separating the t(4;l 1) cases in this group as well. 
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Gene expression patterns of different MLL translocations 

The second method used in our analysis was aimed at uncovering sets of 
genes that characterized each one of the MLL translocations. The process of 
defining the best set of discriminating genes was accomplished using supervised 
learning techniques such as Bayesian Networks, Linear Discriminant Analysis 
and Support Vector Machines (SVM) (Reviewed in Orr, 2002). In contrast with 
unsupervised methods, supervised learning methods learn "known classes", 
creating classification algorithms that may undercover interesting and novel 
therapeutic targets. Our characterization of the gene expression profiles per 
MLL variant and the genes involved in these translocations accomplished using 
supervised learning techniques is shown in Fig. 16. These genes represent 
novel diagnostic and therapeutic targets for MZZ-associated leukemias. 

Gene expression profiles characteristic of the t(4;l 1) and other MLL 
translocations are shown in Figs. 17 and 18 (Fig. 17: Bayesian Network 
analysis. Support Vector Machines analysis, Fuzzy Logics and Discriminant 
Analysis; Fig. 18: ANOVA from the Vxinsight program). The different methods 
allowed the classification of imknown samples within each of the groups with 
accuracy rates higher than 90%, as calculated by fold dependent leave-one-out 
cross validation. This data analysis of gene expression conditioned on karyotype 
generated distinct case clustering, supporting that unique gene expression 
"signatures" identify defined genetic subsets of infant leukemia. This confirms 
recently published data (Armstrong et al, 2002), which revealed that the MLL 
infant leukemia cases are ch£iracterized by specific gene expression profiles. 
However, while groups of genes uniquely sissociated with the MLL cases can be 
identified using supervised learning techniques, infant MLL leukemia seems to 
be an entity comprised of several intrinsic biologic clusters not precisely 
predicted by current standards of morphology, inununophenotyping, or 
cytogenetics. 

Expression levels of FLT3 across various MLL translocations 

Expression levels of the FMS-related tyrosine kinase 3 (FLT3) gene 
were analyzed across different MLL translocations. FLT3, a member of the 
receptor tyrosine kinase (RTK) class III, is preferentially expressed on the 
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surface of a high proportion of acute myeloid leukemia (AML) and B-lineage 
acute lymphocytic leukemia (ALL) cells in addition to hematopoietic stem cells, 
brain, placenta and liver (Kiyoe, 2002). Within MLL subgroups FLT3 is 
variable. The expression levels for this gene were diflferentisilly higher in 
5 t(4;ll),t(ll;19),t(9;ll) and other MZZ translocations (Fig. 14)). However, 
MLL subgroups such as t(l;l 1) and t(10;l 1) had similar expression of FLT3 
compared to not MLL cases, suggesting that the various MLL translocations may 
exert differential influence on the FLT3 expression levels. This may add 
arguments to the previously proposed potential problems in the clinical use of 
10 FLT3 inhibitors for leukemia treatment (Gilliland et al, 2002). 

Discussion 

Gene expression profiling of our infant MLL leukemia cases revealed 
new insights into infant leukemia classification that may increase our 

1 5 understanding of the pathogenesis and hence, treatment options for this disease. 

While groups of genes uniquely associated with each MLL translocation 
variant can be identified using supervised leaming techniques (as previously 
shown by others), infant acute MLL leukemia seems to be an entity comprised 
of several intrinsic biologic clusters not precisely predicted by current standards 

20 of morphology, immunophenotjTjing, or cytogenetics. Unsupervised analysis 
demonstrated that gene expression in specific MLL rearrangements varied 
significantly amongst the three infant groups. As these intrinsic clusters 
appeared to relate to distinct subtypes of infant leukemia, the various MLL 
translocations may represent a critical secondary transforming event for each 

25 biological group, conferring more defined tumor phenotypes. Altematively, 

MLL translocations may be permissive for further genetic rearrangements that 
will strongly influence and define differential gene expression patterns. 
Our findings of heterogeneity of gene expression within and between MLL 
subtypes differ from previous reports suggesting more homogeneous gene 

30 expression (Armstrong, 2002). This probably reflects mainly the larger number 
of cases available to us for analysis. However, rigorous exclusion of 
imsatisfactory samples was also critical for the successful interpretation of the 
data. 



308 



Particular genes that can be selected by supervised methods as 
characterizing cases with MLL translocations, in the current study the presence 
or absence of MLL rearrangements did not define a distinct leukemia class 
during unsupervised learning analysis of the gene expression patterns of these 
infant patients. Despite the fact that supervised analysis of the microarray data 
can successfully segregate patients defined by traditional methods such as 
immunophenotyping and cytogenetics, results from these techniques are most 
useful in the identification of unanticipated similarities and diversities in 
individual patients and thus may be useful in augmenting risk-group 
stratification in the future. Further studies to enhance the ability to classify 
infant MLL subtypes according to shared pathways of leukemic transformation 
wall have important implications for the development of new therapeutic 
approaches. 
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